Learning With Style: Continual Semantic Segmentation Across Tasks and Domains

Deep learning models dealing with image understanding in real-world settings must be able to adapt to a wide variety of tasks across different domains. Domain adaptation and class incremental learning deal with domain and task variability separately, whereas their unified solution is still an open problem. We tackle both facets of the problem together, taking into account the semantic shift within both input and label spaces. We start by formally introducing continual learning under task and domain shift. Then, we address the proposed setup by using style transfer techniques to extend knowledge across domains when learning incremental tasks and a robust distillation framework to effectively recollect task knowledge under incremental domain shift. The devised framework (LwS, Learning with Style) is able to generalize incrementally acquired task knowledge across all the domains encountered, proving to be robust against catastrophic forgetting. Extensive experimental evaluation on multiple autonomous driving datasets shows how the proposed method outperforms existing approaches, which prove to be ill-equipped to deal with continual semantic segmentation under both task and domain shift.


INTRODUCTION
With the recent rise of deep learning, the computer vision field has witnessed remarkable advances.Challenging tasks, such as image semantic segmentation, are nowadays successfully addressed by well-established deep learning architectures [1], [2], [3].Nonetheless, the fundamental problem of continuously learning and adapting to novel environments remains open and is actively investigated, with a long way before its definitive solution.
Although capable of remarkable performance in narrow and confined tasks, deep models tend to struggle when confronted with continual learning of dynamic tasks in everchanging environments.A major issue stands in the tendency to catastrophically forget previously acquired knowledge [4], with new information erasing that experienced so far.Furthermore, variable input distribution between supervised training data and target data has been shown to cause performance degradation, giving rise to the need for domain adaptation, which targets knowledge transferability across domains.Both constitute critical problems when it comes to deploying deep models in practical applications, as in the real world it is very likely to face distribution variability both in terms of input data and of target tasks.
A thriving research endeavour has been devoted to continual learning (also referred to as incremental learning, IL, or lifelong learning [5]) in vision problems, such as image classification [4], [6], [7], object detection [8], [9], [10] and, more recently, semantic segmentation [11], [12], [13].The majority of those works, however, are limited to a class incremental perspective of the continual learning problem, where the focus is strictly posed on the variable task (e.g., class) supervision and label-space shift experienced throughout the learning process.On the other side, a significant research effort has been directed toward the domain adaptation problem, ranging from a static learning setting [14], [15], [16] to, quite recently, a dynamic perspective [17], [18], [19], taking into account incremental changes in the data distribution.
Nonetheless, the general continual learning problem across both tasks and domains is yet unexplored for the semantic segmentation task.Where class incremental methods usually struggle to cope with domain knowledge transferability, domain incremental methods lack predisposition to arXiv:2210.07016v1 [cs.CV] 13 Oct 2022 address incremental task supervision.We instead propose to tackle continual semantic segmentation with joint incremental shift along class and domain directions.The training process involves multiple steps, each of which carries a new set of classes to learn, along with a training set comprising image samples with a step-distinctive distribution, differing from those experienced in previous steps, and supervision available only on the newly introduced class set.The overall objective is for the incremental segmentation model to deliver satisfactory performance across all the tasks (i.e., class sets) and domains encountered so far, with the class-and domain-wise joint training as the target upper bound.
In this novel problem setup (see Fig. 1), both domain adaptation and recollection of past classes must be performed to achieve satisfactory performance.Under the domain incremental angle, it is required to simultaneously learn new classes over past domains and adapt old-class knowledge to the new domain.From the class incremental perspective, recollection of past knowledge must take into account the variable input distribution characterizing the addressed incremental learning problem.
We therefore devise multiple training objectives to face underlying sub-problems.While to rehearse knowledge of old classes we resort to the old-step segmentation model, which is a common practice among class incremental learning methods [11], to replay information of past-domain input distribution we propose a stylization mechanism.The average style (i.e., a very compact representation) of each encountered domain is computed and stored in a memory bank, to be transferred to novel domains in future steps and reproduce some domain-level information.
The overall optimization framework is made of (i) a standard task loss (i.e., cross-entropy objective) to learn new classes over available training data, (ii) an additional task loss instance to learn new classes in old domains by leveraging stylization, (iii) a knowledge distillation-like objective to infuse adapted information of past classes in the form of hard pseudo-labels to the new domain and finally (iv) an output-level knowledge distillation objective applied on stylized images to retain old-domain old-class performance.
To summarize, our contributions are as follows: • We investigate a novel comprehensive incremental learning setting that accounts for variable distribution within both input and label spaces.• We develop a framework to tackle all facets of the class and domain incremental learning problem, based on a stylization mechanism to recall domain knowledge under incremental task supervision and a robust distillation framework to retain task knowledge under incremental domain shift.• We devise novel experimental setups to simulate the proposed learning setting and conduct an extensive evaluation campaign.• We show that the proposed method outperforms existing state-of-the-art methods that address the IL problem only from a class or a domain incremental perspective.

RELATED WORKS
Semantic Segmentation.Under the impulse of deep learning, semantic segmentation has witnessed a considerable advance in recent years [20].Since the introduction of fully convolutional networks (FCNs) [1], which introduced the popular encoder-decoder architecture, huge research efforts have improved the state of the art.Dilated convolutions [2], [21] allow to retain sufficiently large receptive fields limiting the growth in model size.Spatial [22] and feature [23] pyramid pooling extract and aggregate contextual information at different scales to acquire enriched representation for improved dense predictions.At the same time, considerable interest was devoted to the design of lightweight architectures for practical applications typically burdened by strict hardware constraints.MobileNet architectures [24], [25] are built upon the efficient depthwise separable convolution.ErfNet [3] resorts to factorized residual layers to provide real-time accurate segmentation.Recently, transformers have been applied in vision, even for dense prediction tasks such as semantic segmentation [26].
Class Incremental Learning (CIL).Continual learning in the form of incremental classification tasks has been subject of growing research interest in the recent past [5].Extensive literature can be found targeting image classification [4], [6], [7], [27], [28], [29], [30], [31], [32], [33], [34], [35], [36] and object detection tasks [8], [9], [10], [37] under the incremental learning paradigm.Many of these works [7], [28], [29], [33], [34], [35], [36] rely on exemplars, i.e., a small portion of training data is stored to be replayed in future steps.We instead place ourselves in a totally exemplar-free setup.Among the exemplar-free methods [4], [6], [8], [9], [10], [30], [31], [32], [37] we can identify regularization-based [4], [10], [37], rehearsal-based [6], [8], [9], [30], [31] and structurebased [32].Even if many works propose techniques which could in principle be generalized to various vision tasks (such as the prosperous knowledge distillation mechanism [6], [8], [38]), when facing the semantic segmentation task, additional complexity, which is not present in case of wholeimage classification or object detection, arises [39].More limited literature can be found for incremental semantic segmentation [11], [12], [13], [40], [41], [42], even though this field has experienced a very recent rise in research consideration [43], [44], [45], [46], [47].A first direction of study has been oriented toward the adaptation of the knowledge distillation mechanism to incremental semantic segmentation [11], [12], [13], [40], [43], [44], [47].Michieli et al. [11], [48] have been the first to introduce this technique in CIL for dense classification, proposing both feature-and output-level variants of the distillation objective.In [12] authors address the semantic shift of background regions by proposing a novel distillation formula.Furthermore, [13] improves feature-level distillation by pooling representations to capture spatial relationships.Phan et al. [47] introduce a measure of task similarity as a weighting factor in the distillation objective.Yang et al. [44] resort to a structured self-attention approach for preserve relevant knowledge.Finally, [43] extends the popular contrastive learning paradigm to incremental semantic segmentation to improve class discriminability in the feature space.Nonetheless, none of the aforementioned works address the distribution shift that could be present across tasks within the input space.We propose to use a distillation objective which is robust to domain incremental gaps, and targets the preservation of old-task knowledge both on the current domain, by distilling through robust hard pseudo-labels, and on the past domains, by leveraging domain stylization to distill knowledge when experiencing old-domain input statistics.Targeting semantic discriminability of latent representations, a clustering-based objective built upon class prototypes is proposed in [42].Maracani et al. [41] introduce a novel rehearsal approach based on the retrieval of training samples by external sources, i.e., via GAN-based generation or web-crawling.Cermelli et al. [45] further show that it is possible to perform continual training with only image-level annotations in incremental steps and reach high accuracy in some CIL experimental setups.Nonetheless, this approach could be susceptible to the amount of dense supervision provided in the first learning step, and might not scale well to segmentation of images containing objects of different classes.Zhang et al. [46] devise a dynamic incremental framework to decouple the representation learning of old and new tasks.All the aforementioned works assume statistical homogeneity across learning steps in terms of input data distribution.On the other hand, we address the more realistic setup with both input and label spaces undergoing incremental shifts, and we show the superiority in this generalized setup of the proposed incremental approach compared to pure CIL competitors.Domain Adaptation (DA).Deep models are known to suffer performance degradation when presented with varying input distribution between training and testing phases [49].Domain adaptation has been extensively investigated to alleviate the aforementioned problem, by safely transferring learned knowledge from label-abundant source domains to label-scarce, or even unsupervised, target ones.Particularly flourishing has been unsupervised domain adaptation (UDA) for the semantic segmentation task [14], [15], [16], [50], [51], [52], as supervision in terms of dense segmentation maps is usually very costly and time expensive to be collected for real-world data.In its standard form, UDA entails no continual learning, being the task at hand the same on both source and target static domains, which are concurrently available.We instead address a more realistic setup with dynamic task and domain evolution.
More recently, different variations of the static DA have been proposed, relaxing some of the original strict assumptions.One research direction involves distinct tasks between source and target domains, i.e., allows source and target classes to be different.Depending on the relationship between source and target class sets, partial [53], open-set [54] and universal [55], [56] domain adaptation setups have been proposed, even though most research has been confined to the image classification problem [54], [55], [56].Moreover, these works do not involve class incremental learning, as adaptation is performed with simultaneous access to source and target domains in a single learning phase.
Another line of works has explored diverse setups in terms of domain availability.Some propose to handle multiple source [57], [58] or target [17], [18], [19], [59], [60], [61], [62] domains.This can involve a single adaptation phase [57], [58], or multiple phases where different domains are experienced in different learning steps in a incremental fashion [17], [18], [19], [61], [62], in fact, undertaking continual learning under the domain adaptation perspective.Yet, all these works assume homogeneity of tasks across all the domains encountered, whereas the class and domain incremental setup we propose deals with variable learning conditions both along task and domain progressions.Garg et al. [63] develop a multi-domain incremental learning (MDIL) framework that involves classification tasks shifting across multiple domains experienced in an incremental fashion.However, total supervision is available on all the domains encountered, leading to overlapping incremental class sets.We instead adhere to a stricter CIL setup, with disjoint groups of semantic categories incrementally introduced.
It is possible to find a few works that address both task incremental and domain adaptation problems.Kalb et al. [64] discuss class and domain incremental learning, but each task is tackled individually by evaluating standard CIL and DA methods.In [65] coarse-to-fine continual learning is explored, but the proposed setup does not involve domain shift across learning steps, as source and target domains are kept fixed.Recently, Simon et al. [66] address continual learning with tasks and domains dynamically evolving.Still, they assume to have task supervision on all the considered domains at each task incremental step, which may not be a realistic assumption in real-world applications.In addition, rehearsal of training exemplars is performed, and the method specifically targets image classification.

PROBLEM SETUP
In semantic segmentation we aim at labeling every individual spatial location of an image by associating it with a semantic class taken from a predefined collection of candidates C. That is, given an RGB image X ∈ X ⊂ R H×W ×3 , a segmentation network S : X − Y is exploited to provide its segmentation map Ŷ ∈ Y ⊂ C H×W .Ŷ should be an accurate prediction of the ground truth map Y, which is available only at training time.
We follow an incremental learning protocol to optimize the segmentation network, as depicted in Fig. 2. Specifically, the predictor is trained in multiple steps t = 0, ..., T − 1 to recognize a progressively increasing set of semantic classes.At step t, a new class set C t is introduced, along with training data D t = {(X t , Y t )} ⊂ X t × Y t associated to that set, which is available on the current image domain X t .The supervision provided by D t is restricted to C t , meaning that any pixel within D t is tagged in Y t with c ∈ C t .At the end of the step, all the currently accessible data is discarded and is not reused again.The procedure is reiterated for multiple learning steps, with a new domain X t and class set C t being introduced and used for training at each step.More formally, the objective is to train S t : X 0:t − Y 0:t • to recognize all the semantic classes observed up to the current step t: • on all the image domains experienced so far: We remark that {X t } T t=0 are characterized by diverse statistical properties, i.e., domain shift occurs between them,

TRAINING TIME TESTING TIME
Step 0 Step 1 Step 2 Step 0 Step 1 Step 2 typically manifested through cross-domain variable visual appearance of scene elements that yet share semantic significance.All C t are disjoint sets, except for the unknown (u) class, which belongs to each of them.Class u at step t contains all the past and future classes.In other words, u undergoes a semantic shift across subsequent steps and, for this reason, demands special care when being handled [12].

OVERVIEW OF THE PROPOSED METHOD
We concurrently face challenges peculiar to both the domain adaptation and the class incremental learning settings.Domain Adaptation.The segmentation network is trained on data from multiple domains, each holding only a subset of the whole set of the semantic classes.Even so, the model is expected to provide satisfactory prediction performance on all the observed domains and semantic classes.Hence, it is necessary to transfer knowledge across domains to (i) learn new-class clues shared across the current (supervised) domain and the past ones (where new-class supervision was not available during past steps); (ii) adapt old-class knowledge learned in former domains to the novel domain.Class Incremental Learning.The different class supervision available on different domains leads us to a class incremental problem, where semantic categories come across in a continual fashion.Therefore, we are required to address the widely known catastrophic forgetting phenomenon [4], aiming at preserving knowledge from past classes when learning new ones.However, unlike standard CIL, knowledge preservation has to be performed differently depending on the domain in which it is applied: (i) in past domains straightforward recollection of previously observed classes can be imposed, as those classes were learned over past domain distributions; (ii) whereas, in the novel domain recalled memory of past classes should be adapted to account for the semantic shift happening within the input space.We break down the domain shift and class continual learning problems into simpler underlying sub-problems, as indicated above.Our overall learning framework builds upon multiple individual objectives, each focusing on a (iv) preserve old-class information in old domains (Sec.5.4).

Domain Stylization
We resort to a style transfer mechanism to recreate image data with statistical properties resembling those of past domains.More specifically, starting from the available image data originating from the input domain accessible at the current step, we transfer the styles extracted from all the previously encountered domains.By doing so, a stylized version of each of the former domains is produced, with image content derived from the novel dataset.
The benefits that originate from domain stylization are manifold: (i) We force the prediction model to experience past input distributions under supervision or pseudosupervision, tackling domain-level catastrophic forgetting.(ii) We aim at learning new classes on old domains, where supervision was not available when they were directly observed.At the same time, we propose to preserve oldclass knowledge on old domains, counteracting class-level catastrophic forgetting.(iii) By encountering a variegate input distribution, the predictor is encouraged to develop the ability to generalize to unseen domains, which is crucial in a continual learning paradigm that involves domain shift.
The style transfer mechanism we adopt is inspired by [16] and involves low computational cost and memory requirements.The original algorithm works in the Fourier transform domain: the low frequency portion of the amplitude of the spectral representation from a target image (i.e., the style) is extracted and applied to replace that of a source image (i.e., the content), whose phase component is kept unchanged.The outcome is image data with source semantic information, and target-like low-level appearance.
We enhance the original method to accommodate for the further complexity brought in by the class and domain incremental setting.From each image of the currently available dataset, we extract its style tensor (i.e., the amplitude central window), and we average it over all the samples: where F A (X) is the amplitude obtained by the FFT applied to image X, and W β is the style window.By doing so, we are extracting significant knowledge of domain-dependent statistical properties, condensed in a compact representation.The domain-specific style F A t of step t is stored in an incrementally-filled memory bank M F 0:t 1 = { F A k | k < t} and preserved across steps.By leveraging the proposed storage mechanism, at each incremental step we can access crucial information of past domain low-level properties (yet minimal if compared to that contained in whole training sets), without requiring direct access to raw image data, which would violate the exemplar-free assumption.We stress that domain shift affects low-level details, while highlevel semantic content is mostly shared across domains (e.g., the road serves the same purpose regardless of the dataset, while its appearance in terms of texture or pavement material might vary considerably).To create an oldly-stylized dataset at step t looking back at step k < t (i.e., X t k ), for each image of the current domain we replace its amplitude window with that of the selected former domain as follows: ) where F 1 is the inverse FFT operator and F P (X) is the Fourier phase component of X.In addition, we devise a self-stylization mechanism by self-applying domain style to improve generalization toward future steps, promoting forward transfer.As for the dimension of the style window, we experimentally found that the β parameter as defined in [16] (i.e., the parameter controlling the window size) provides satisfactory and robust results when set to 1e 2.
Finally, we stress that our approach is independent of the style transfer technique used, provided that style information and content can be extracted in two distinct steps.

Learning New Classes over New Domains
In the proposed class and domain continual learning framework, direct supervision comes uniquely for the newly introduced class set C t and image domain X t in the form of the training dataset D t ⊂ X t × Y t .As mentioned before, image pixels not belonging to C t , i.e., of past or never seen classes, are assigned to a special class unknown, whose semantic statistical properties are highly dynamic.
To account for the semantic shift suffered by the unknown class at the current step t > 0 w.r.t.previous steps, we group the past and unknown class probability channels as follows: where P t (X) ∈ R H×W ×|C0:t| is the output of S t prior to the argmax when a generic image X ∈ X is given as input.
We additionally define Dt t ⊂ Xt t × Y t as the self-stylized training dataset at step t, where the average style (defined above in Sec. 4) of the current image domain has been applied on top of the X t domain itself.
To learn the newly introduced classes over the new domain we optimize: where we leverage input data with current style and supervision over the new class set.The ñ superscript indicates the use of self-stylized data on the new domain.The purpose of self-stylization is twofold; first, it provides additional robustness and generalization capability to the prediction model, since input data is supplied with more homogeneous low-level statistic across individual samples.Second, it forces the prediction model to experience domain statistics that will be stored and replayed in the future, acting as proxies for the no longer available previous domain statistics.

Learning New Classes over Past Domains
To compensate for the lack of available input data for past domains, we generate proxy datasets retaining lowlevel statistics resembling those of past domains.More precisely, for each style as detailed in Sec.4), i.e., an oldly-stylized training dataset at step t, for which domain-specific visual attributes of step k < t has been applied on domain X t .
Supervision on the newly introduced classes over the old domains is exploited by optimizing: where we leverage input data with past styles (i.e., with distributions supposedly close 1 to those of no longer available former domains) and the supervision over the new class set.
The superscript õ indicates the use of oldly-stylized data.By concurrently learning the segmentation task at the present step over an augmented pool of input data distributions from the past, the prediction model should learn more general and shareable clues, overcoming the domain shift inherent in the domain continual learning paradigm.

Adapting Old Classes to New Domains
In the addressed class incremental learning scenario, at each new learning step all past class sets are assumed to lack any direct supervision.To recall previously acquired knowledge, we resort to the well-known knowledge distillation objective [38].Yet, differently from the standard class incremental learning problem as traditionally formalized in the literature [7], we expect to encounter additional challenges: (i) the input data of past domains (i.e., experienced by the segmentation model when previous class sets were learned) are no longer available; (ii) a distribution shift separates the current image data to that available at former steps.Thus, we no longer have access to data distributed as that experienced by the segmentation model saved from the past step, which, in principle, should be leveraged to distill knowledge of old classes.
To replicate the image distribution of data of past steps, we resort to the stylization mechanism (Sec.4).Specifically, for each old domain X k , k < t, we build an oldly-stylized dataset D t k starting from that of the current step t.To access a form of supervision over the past classes we make use of pseudo-labeling via the prediction model from the previous step, which should retain profitable knowledge on the semantic categories learned so far.However, said model might not distill knowledge effectively when fed with input data of an unseen distribution, i.e., originating from the newly introduced domain.Therefore, we exploit oldly-stylized data to enhance pseudo-labeling by mitigating domain shift.We denote with P k t 1 ( X) ⊂ R H×W ×|Ct 1| , X ∈ X t k , the classification probability map from model S t 1 over new domain images with the style of step k.We then compute pseudo-labels following: where we leverage old model predictions over past styles, i.e., we set K = {0, ..., t − 1}, while max k∈K P k t 1 ( X)[x, y] indicates that for each spatial location (x, y) we take the probability vector associated to the style with maximum peak value.We then refine the generated pseudo-labels at each spatial location (we will shorten ŶK={0,...,t 1} t 1 as Ŷ<t t 1 and drop the term [x, y] for ease of notation) as: where Y t ∈ Y t .The hard pseudo-label Ŷ<t t 1 [x, y] (i.e., after the argmax operation in Eq. ( 8)) is considered to provide a confident prediction if the peak probability value (of the probability map prior to the argmax) is bigger than a threshold τ , or if that value is among the top-K fraction of highest peaks for class c = Ŷt 1 [x, y].We set τ = 0.9 and K = 0.66 as advised in [16].In addition, we leverage the ground-truth supervision on new classes to correct noisy estimations in pseudo-labels, by marking as unknown (i.e., u) all the pixels of newly introduced categories.We remark that the employed knowledge distillation is designed to provide insight on previous tasks (where current new classes were assigned to the u class), whereas we entrust Eq. ( 6) to instill understanding of the novel task.We experimentally verify that using separate objectives to train on new and old classes leads to improved results, as it forces the model to learn to better discriminate between different incremental class sets, part of which might coexist under the same unknown group for one or more learning steps.This is especially true for autonomous driving datasets, where each image can contain several semantically diverse elements, for all of which we may not have supervision from the start of the training.
To infuse adapted information about past classes at the current step without direct access to ground-truth information, we resort to the following objective: by which we distill knowledge of past tasks (i.e., recognition of classes in C 0:t 1 ) over the new domain X t via the pseudolabels derived from the old model S t .
To account for the semantic shift suffered by the unknown class of step t−1 when moving to a new step t > 0, we group new and unknown class probability channels as follows: where * P t (X) ∈ R H×W ×|C0:t 1| .We opt for the use of hardlabels in place of the more common soft-labels in the distillation-like loss in order to prevent enforcing an uncertain behavior to S t .This behaviour could be originated by the mismatch between training and inference input distribution undergone by the old model S t−1 , which has been trained over past domains and now is fed with new domain data (the oldly-stylizing operation reduces domain shift but has no guarantees on its complete removal).Experimental data on the pseudo-labeling strategy is provided in Sec.8.2.

Preserving Old Classes on Old Domains
In Sec.5.3 we focused on distilling old-task knowledge on the current novel domain.Nonetheless, our ultimate target is to end up with a segmentation network capable to recognize all the observed classes over all the experienced domains, that is a prediction model robust to both domain and label distribution shifts.For this reason, at every novel incremental step it is required to preserve the task knowledge acquired in the past, that is, on past classes over past domains.To do so, we leverage the output-level knowledge distillation objective in its standard formulation [38], where we force a student model (i.e., the current model) to mimic the predicted classification probability distribution of a teacher model (i.e., the model saved and kept frozen since the end of the previous step).We opted for the objective in its standard fashion [38], as both image and label distributions ideally originate from previous steps, so no domain shift should, in principle, affect the distillation process.In practice, we can not access former incremental datasets.Therefore, to retrieve the missing old-domain data, we resort once more to stylization (Sec.4), so that we can leverage oldly-stylized data as proxy for the missing original images.The final objective is of the following form: where * P t ( X) ∈ R H×W ×|C0:t 1| refers to the modified probability distribution from Eq. ( 11), for which new and unknown categories are incorporated into a single output channel to address the label shift within the u class.
The overall objective is given by:

EXPERIMENTAL SETUP
In this section we provide a detailed description of the experimental setup utilized to validate the proposed framework against multiple competing methods.In Sec. 7 and 8 we will report the results of the evaluation campaign and extensive ablation studies as additional support.

Datasets
To simulate the distribution shift at the input (image) level, we make use of multiple driving data sets, each limited to a specific geographic region or environmental factors, and thus characterized by its distinctive low-level appearance (e.g., road pavement material, type of vehicles, light conditions).On the contrary, the high-level semantic content is mostly consistent across image sets, that is, the road-related or other categories, moving and static obstacles can be found everywhere, and follow similar inter-class structural relations (e.g., the sky will always appear above the road).
Cityscapes.The Cityscapes [67] dataset (CS) is a popular benchmark for autonomous driving applications.Images are collected across 50 cities, all located in Central Europe.BDD100K.The Berkeley DeepDrive dataset (BDD) [68] is a more diverse collection of road scenes, captured with variable weather conditions at different times of the day.Still, all samples are from 4 restricted localities in the US.IDD.The Indian Driving Dataset (IDD) [69] includes driving scenes from Indian cities and their outskirts.It offers a diversified set of moving and static road obstacles, as well as a wilder and more natural environment, which breaks away from the typical European or American urban scenarios.Mapillary Vistas.The Mapillary Vistas dataset [70] contains images collected worldwide, with highly diverse acquisition settings and locations.Unlike previously introduced benchmarks, samples are not limited to a few cities located within quite uniform geographic regions.We leverage the Mapillary dataset to generate continent-wise data splits, as well as to test the domain generalization potential of the proposed class and domain incremental approach.Shift.The Shift benchmark [71] is a synthetic dataset for autonomous driving, designed to provide a plethora of distribution shifts, simulating the highly variable environmental conditions faced in real-world applications.We exploit it to mimic domain shift due to environmental diversity.For BDD, IDD and Mapillary datasets, only the 19 classes available on Cityscapes were used.For Shift, we considered the available 22 semantic categories.

Incremental Learning Setup
Domain Incremental Setup.The first domain incremental setup is created by experiencing in succession the CS, BDD and IDD datasets (in different orders) during 3 separate learning steps.Additionally, we propose a further setup, where domain shift across learning steps is achieved by splitting the entire Mapillary dataset into incremental sets based on geographic proximity of samples, i.e., 6 separate data subsets are generated, grouping together pictures taken on the same continent.Finally, we leverage Shift to simulate incrementally variable environmental conditions, by partitioning the whole dataset into 3 groups of samples according to light conditions (i.e., daytime, twilight and night).Class Incremental Setup.We start by following [40] to identify 3 separate groups within the 19 Cityscapes' classes, i.e., (i) background regions, (ii) moving elements, (iii) static elements, which are observed incrementally under various arrangements.Then, we extend the aforementioned 3-way class splitting to Shift in a similar fashion to [40], this time on the 22 classes offered by the synthetic benchmark.All the class incremental sets are detailed in Table 2.
By merging class and domain individual settings, we devise each class and domain incremental setup reported in Table 3.The first (i.e., urban) is generated using CS, BDD and IDD datasets, together with the 3-way class split from [40].Formally, we set the total number of learning steps T = 3, and at each step 0 ≤ t < T : (14) where each dataset and class split is observed once.We further propose an incremental setup (i.e., worldwide) based on continent-wise splitting of the Mapillary dataset.To match the increase in domain set size to 6 elements, we divide each class group [40] in half, for a total of 6 class splits (Table 2).We set T = 6, and at each step 0 ≤ t < T : where each class set and each domain appears only in a single step.Among the large number of possible incremental sequences, we perform the experimental evaluation in the EU NA AS OC AF SA and Finally, the last setup (i.e., environmental) combines the environmental partitioning chosen for Shift with the 3-way class splitting from [40].

Implementation Details
We built our framework in PyTorch.Due to the complexity of the investigated problem, in most experiments we use a lightweight segmentation model, i.e., ErfNet [3].We argue that a smaller network complies more realistically to deployment-related constraints in real-word applications, e.g., in terms of memory occupation and inference speed.Yet, for comparison purposes we report additional results with the heavier and better performing DeeplabV3 architecture [72] with ResNet101 backbone [73].In all experiments, the segmentation model is pre-trained on ImageNet [74].
With ErfNet, we use the Adam optimizer [75] and learning rate set to 5e 4. With DeeplabV3, we use the SGD optimizer and learning rate set to 1e 3. Weight decay is fixed to 1e 4, and we employ a polynomial decay of power 0.9 for learning rate scheduling.We train for 100 and 50 epochs at each learning step, with ErfNet and DeeplabV3 respectively (except in Shift, where we set the number of epochs to 10).With ErfNet we use a batch size of 6, with DeeplabV3 we reduce its value to 2 due to GPU memory constraints.
When experimentally evaluating on Cityscapes-BDD-IDD and Shift setups, images are resized to 512 × 1024 resolution.When using Mapillary for training, inputs are first resized to 1024 width (fixed aspect ratio), and then cropped to 512 × 1024.This pre-processing is done to accommodate for the highly variable aspect ratios of Mapillary's samples.
The β parameter controlling the size of the style window is empirically set to 1e 2 and fixed in all experiments.Plus, we experimentally fix λ õ ce = λ ñ kd = λ õ kd = 10, and keep them unchanged in every incremental setup.This shows that our approach is robust to change of experimental setting, and requires minimal hyper-parameter tuning.Ablation studies on the impact of β and loss weights are in Sec. 8.

Competitors
To the best of our knowledge, this is the first work explicitly modeling and addressing class and domain incremental learning in semantic segmentation.For this reason, we compare with other methods targeting class (CIL) or domain (DIL) incremental learning as individual problems.
Among class-incremental methods, we consider ILT [11] and MiB [12], along with state-of-the-art PLOP [13] and UCD [43].When using PLOP with ErfNet, we apply the LocalPOD loss [13] on embeddings extracted at the end of the first and second blocks, as well as at the output of the encoder.For UCD, we modify the contrastive distillation loss so that the maximum number of positives and negatives is set to 3000 each (which are randomly selected among the whole sets as defined in the original work).We perform this adjustment to meet GPU memory limitations.All experiments were performed on a RTX Titan GPU with 24GB of memory.We believe that a fair comparison should involve comparable GPU resources for all the competitors.
On the domain-incremental side, we compare with [63].Differently from our setup, they assume to have full task supervision on all the domains incrementally encountered.We adapt their framework to a class-incremental setup by replacing the standard cross-entropy loss with the unbiased version from [12], to prevent the background shift from erasing the task-knowledge learned in past steps.

Metrics
Inspired by [63], to provide a valuable measure of prediction performance across multiple tasks and domains, we resort to a domain average relative performance w.r.t. a fullysupervised oracle reference (the smaller the better) defined at any step t as: where A C X |S is the class-average accuracy (we make use of the commonly employed mIoU metric [20]) attained by segmentation network S on domain X and class set C. S * is the oracle segmentation model, i.e., trained with full supervision on the entire pool of classes and domains (even classes and domains that will be observed after step t).
We further provide a measure of generalization aptitude (the higher the better), expressed as the accuracy (i.e., in terms of mIoU) achieved over the entire class set observed so far on a novel dataset never experienced before.At step t, the metric follows: where X ext is the unseen domain.

Evaluation on Urban Scenes
The first experimental setup we explore entails incrementally transitioning between urban and suburban areas of different regions around the world.High-and low-level image contents undergo distribution shifts of different extent: although it might be reasonable to assume that the basic semantic structure of road images is invariant to geographic location, scene elements are likely to change appearance significantly when travelling around the world.Step 0 Step 1 Step 2 CS (X0) BDD (X1) CS (X0) Step 0 Step 1 Step 2

Study on Domain Ordering
To reproduce class and domain distribution shifts, we train on the Cityscapes, BDD and IDD datasets in an incremental fashion.The class incremental protocol is instead the one proposed in [40] (i.e., C bgr C stat C mov ).As detailed in Sec.6.2, we define a total of 3 learning steps.In Tables 4, 5 and 6 we report experimental results following 3 different dataset orders, so that each dataset is viewed at all the 3 possible learning steps, considering all experiments performed.We report results in terms of mIoU computed over all classes excluding the unknown one, as typically done in the literature.The mIoU is computed for each domain X k (i.e., dataset) experienced up to a current step t (i.e., mIoU k t , k ≤ t), ∀t < T .In addition, we provide a measure of relative performance w.r.t. a supervised reference, both for individual domains ∆ k t , and as a global quantity ∆t (Eq. ( 16)).The supervised reference, denoted as Oracle, corresponds to the joint training over both class sets and domains.
We compare with methods addressing class incremental learning (ILT [11], MiB [12], PLOP [13] and UCD [43]) and with a recent domain incremental method (MDIL [63]).We also include a simple baseline, activating only the task loss on the new classes and new domain (Eq.( 6)).This approach is usually referred to as fine-tuning, as the focus is just posed on learning the new task.Two variants are reported for this baseline, i.e., with or without self-stylization applied on input images, indicated respectively as L ñ ce and L n ce .As for our approach, we evaluate its final form (Eq. ( 13)), complete of all the training objectives detailed in Sec. 5, as well as a simpler configuration without the L õ kd loss (Eq.( 12)).By inspecting results in Tables 4, 5 and 6, we notice that the performance achieved by different methods at the end of the initial learning step are comparable.This is due to the similar objectives employed so far, to learn just the first class set (C bgr ) on the first domain, regardless of the domain order.We remark that the proposed self-stylization is not detrimental when learning the current task.We will provide some ablation studies on the impact of stylization in Sec. 8 When progressing to the first incremental step, catastrophic forgetting has to be addressed to retain good performance.We observe that the L n ce and L ñ ce losses alone are not sufficient to achieve satisfactory results, being focused on the new task and providing no constraints to preserve past knowledge.MDIL [63] performs poorly as well, since the proposed dynamic architecture is not suitable to address partial class incremental supervision, which in our setup is present along with domain incremental shift.By analyzing class incremental learning methods, we note that they are able to preserve previously acquired knowledge to some extent, while allowing some plasticity for learning the new task.Still, the domain shift between previous and current datasets has a negative impact on the prediction accuracy of the incrementally trained predictor.All the considered CIL methods, in fact, rely on the ability of a segmentation model frozen from the previous step to preserve knowledge of the past.Yet, because of the domain discrepancy between past and new data, this distillation mechanism could introduce unreliable guidance on former tasks, as the frozen model is subject to a shift in the experienced distribution at the input level when fed with new domain data.At the same time, the distribution gap may hinder the transferability of new-class knowledge to old domains, which are no longer available as training data.These drawbacks are revealed by results of Table 6 (IDD CS BDD): the significant domain shift between the Cityscapes and IDD datasets prevents CIL methods from Step 0 Step 1 Step 2 IDD (X0) CS (X1) effectively preserving and learning task-related clues on IDD, which was experienced at step 0. On the contrary, our approach addresses domain shift by leveraging the stylization scheme and applying carefully designed objectives to suitably tackle the general class and domain incremental learning.In particular, the proposed objectives L õ ce (Eq.( 7)) and L õ kd (Eq.( 12)) are specifically designed to address the aforementioned problems affecting CIL methods and allow to achieve superior accuracy on former domains.As a result, LwS improves accuracy by more than 17 mIoU points on IDD at step 1 w.r.t. the best competitor (i.e., UCD [43]).We also remark that, even with alternative domain orders (Tables 4 and 5), LwS shows the best stability-plasticity trade-off, retaining the best overall accuracy in terms of ∆1 .Furthermore, we can see that, for both CS BDD IDD and BDD CS orders, the addition of the L õ kd objective in LwS leads to boost in performance on the past domain, which coincides with the design purpose of the objective.
In the final learning step, the struggle to handle the class and domain incremental training is exacerbated for all the competitors.Baselines and MDIL still provide inferior results, with the latter performing even worse than naïve fine-tuning with self-stylization in some setups.As for CIL methods, PLOP [13] and UCD [43] are the best performing.Both combines output and feature level objectives, which prove to be somewhat robust to domain shift.Even so, the simpler MiB [12] approach shows very competitive results, suggesting that strategies taking into account only a class incremental perspective may not be so effective when incremental domain shift is also occurring.Our method in its complete form greatly outperforms all CIL competitors by a large margin regardless of domain order, going from 5% (BDD IDD CS) to 12% (CS BDD IDD) and even 16% (IDD CS BDD) in terms of ∆2 gap.Furthermore, in Table 7 we investigate the generalization performance (i.e., Γ gen t from Eq. ( 17)) achieved by the considered methods.To do so, we compute the accuracy at each incremental step on the unseen Mapillary dataset for the sets of classes observed so far.We notice that simple finetuning and MDIL offer poor generalization results, which is expected due to the low accuracy they already provide on datasets directly observed.On the other hand, CIL methods reach more competitive results, even if none of them proves to be superior in all setups.Still, our approach outperforms all competitors, getting significantly closer to the Oracle upper-bound (i.e., the supervised training on the entire  Finally, qualitative results in the form of segmentation maps are provided in Fig. 4. We stress how the proposed approach yields better backward and forward transfer throughout the incremental learning.In particular, moving classes like bicycle and bus appear to be recognized more effectively by our method on the Cityscapes (CS) dataset at the end of the incremental training, even though CS was experienced only along with background-class supervision during the first step.On the other hand, MiB and PLOP fail to provide satisfactory backward transfer of those classes to the past CS domain.A similar reasoning can be done regarding the forward transfer aptitude.Our approach is able to deliver good segmentation accuracy on the road and   [63] performing poorly, and the improved accuracy achieved by CIL methods still being largely outperformed by the proposed approach.
In addition, we observe that the absolute results are decreased by applying the new class order.The performance of our approach, in fact, drops from 31.28% to 39.29% of ∆2 .This discrepancy might be due to class sets observed on domains where it is harder to learn them, and, at the same time, to generalize to the other domains.For instance, we note that IDD provides a lower overall percentage of pixels of C stat w.r.t. the BDD (11% vs 17%), while for C mov numbers are similar between them (both around 10% of total pixels).Still, the performance loss is similar for CIL methods, with the gap w.r.t. the best competitor rising from 12 to 13 points of ∆2 (compared to the previous class order).

Study on Model Architecture
We finally evaluate the considered methods when a more complex segmentation network is used, moving from the lightweight ErfNet to the heavier DeeplabV3 with ResNet101 backbone.For comparison purposes, the setup analyzed is again that involving CS BDD IDD and C bgr C stat C mov orders (Table 9).For what concerns our approach, we observe an improved relative performance, raising from 31.28% to 28.53% in terms of ∆2 .We em- phasize that the ∆ measure already takes into account the better oracle results; the accuracy boost, then, shows that our method is able to capitalize the increased capacity offered by the segmentation model.On the other hand, the CIL competitors are unable to take advantage of the growth in network capacity, which could indicate a tendency to overfit on the currently observed domain distribution.The best competitor (i.e., UCD), in fact, is significantly outperformed by more than 20% in terms of ∆ at both steps 1 and 2. We remark that no additional parameter tuning is performed in this experimental setup concerning method-specific parameters.

Evaluation with Larger Geographic Diversity
The second experimental class and domain incremental setup we explore is derived from the Mapillary dataset.Domain shift is once more induced by the variable geographic origin of image samples collected worldwide, i.e., we identify data partitions associated to 6 different continents, corresponding to 6 incremental steps.However, the Mapillary dataset contains variegate data distribution, even considering intra-continent samples, providing a more robust support for training segmentation models.Data richness in turns promotes generalization across steps, in fact lessening the domain gap between different domains.We report experimental results in Table 10.In the first steps, when the domain shift is small (e.g., between Europe, EU, and North America, NA), the different methods achieve similar performance.Nonetheless, when progressing to the last steps and experiencing increased statistical gap (e.g., when introducing Africa's images, AF), we note that our approach outperforms CIL competitors by a considerable margin, which is of 5 points of ∆ w.r.t. the best competitor (PLOP) at the end of incremental training.Also, superior performance in later steps is attained on both new and old domains, confirming the better plasticity-stability trade-off provided by our method.Overall, the improved results LwS reaches w.r.t.state-of-the-art CIL competitors, even when training data is collected to ensure some statistical diversity (as in the experimental setup just considered), further suggests that CIL methods are likely to be inadequate to deal with distribution shift in the input space.

Evaluation with Variable Environmental Conditions
We evaluate the proposed method when incremental domain shift is due to changing environmental factors, i.e.,   [40], with the only difference from [40] being that the starting class pool to be split corresponds to the 22 Shift's categories in place of the 19 Cityscape's ones.Results are reported in Table 11, where we compare with MiB and PLOP as CIL competitors, along with fine-tuning baselines.We verify the superiority of our approach in jointly handling class and domain incremental training, as we surpass PLOP by 13 points of ∆2 .We once more point out the better stability-plasticity balance reached by our method, which achieves improved performance simultaneously over novel and former domains.Overall, results show that the proposed method is effective under domain shifts of different nature.On the other hand, CIL methods prove to be greatly penalized just from the variable scene illumination in different tasks.We argue that in many real-world applications, such as autonomous driving, it is unrealistic to assume that a continual learner will not experience any sort of alteration in input data distribution, making our continual learning approach much more applicable.

ABLATION STUDIES
In this section, we provide extensive ablation studies to investigate key features of our approach.We will consider the urban experimental setup, with CS BDD IDD domain and C bgr C stat C mov class orders, unless otherwise stated.

Contribution of Individual Optimization Objectives
We investigate the impact of each of the proposed learning objectives in the overall optimization framework in Table 12.
Just leveraging the currently available training data by finetuning (first two rows) yields unsatisfactory results (even with self-stylization), leading to catastrophic forgetting of class and domain knowledge.Yet, L n ce (or L ñ ce ) is essential to learn new tasks, so it will be kept in the following analyses to test multi-term objectives.
By adding a second term in the overall objective (second block of rows) we improve results, especially if the supplemental objective is focused on retaining old-class knowledge.We reach, in fact, the best performance with a 2-term configuration when L ñ kd is introduced.This suggests that old-class knowledge preservation is effective even when applied on the new domain, which is directly experienced by means of the available training data.At the same time, the L ñ kd objective allows to retain good accuracy w.r.t.past domains, thanks to the improved generalization aptitude promoted by the stylization mechanism, without which (i.e., third row of the block) multiple accuracy points are lost.
When analyzing 3-term objectives (third block of rows), we see noticeable gain with different combinations, except for L ñ kd and L õ kd jointly active, where the excessive focus on past-class knowledge preservation generates training instability.In the last row of the block, we clearly see that, by adding the L õ ce loss on top of the best two-term configuration, the incremental learning becomes more robust, with improved final results on all domains.
Finally, we remark that the full framework (last block) yields the best overall performance, with stylization once more playing a substantial role.The overall performance is, in fact, strongly degraded if stylization is turned off, as showed in the second last row.

Pseudo-label Generation
We further analyze the influence exerted by pseudo-labeling in Table 13.We remark that the proposed enhanced labeling mechanism (described in Sec.5.3) exploits oldly-stylized images to mitigate the domain shift endured by the frozen segmentation model distilling knowledge from the past.
We notice that when self-stylization is disabled (first two rows) the efficacy of our method is reduced, while the beneficial effect offered by the self-stylizing module can be appreciated in the last two rows.This occurs because self-stylization better prepares the segmentation model for Fig. 5: Different ways of pseudo-labeling (t = 2).White regions correspond to the ignore label.
future steps, in which the stylizing mechanism leverages old-domain styles to inject old-domain knowledge into the ongoing learning step.In other words, when self-stylizing images, what will be experienced as an old style will have already been experienced as a new style before.Therefore, the undesired visual artifacts generated by style transfer are experienced by the network from the very first step in which each domain is introduced.This, in turn, ensures greater robustness over the incremental learning process.Furthermore, in setups with self-stylization, as opposed to what occurs without it, pseudo-labeling performed on top of oldly-stylized images yields the best overall performance, if compared to the same labeling process executed over image samples with new-domain style.This happens because the network (frozen from the past step) used to generate pseudo-labels is better equipped to face input distributions of previously experienced old domains, while, instead, it may suffer from domain shift when presented with new unseen input distributions.In Fig. 5 we report pseudo-labels generated according to different criteria, to provide visual confirmation of the improved pseudo-supervision achieved on top of the oldly stylization.The considered setup involves CS BDD IDD and C bgr C stat C mov progressions, and maps are retrieved at the last step (i.e., t = 2).We observe that the segmentation model taken from step t − 1 (i.e., second last step) is not detecting the sky region of the new-domain image, i.e., Y {2} t 1 provides unreliable supervision by labeling the top portion of the picture as unknown (when the true sky class is among those already seen).On the other hand, when leveraging oldly-stylized images to generate pseudo-supervision ( Y <t t 1 ), more reliable old-domain guidance ( Y {0} t 1 and Y {1} t 1 ) is exploited, with individual positive contributions successfully merged in the final map (e.g., in sky and road regions).Thus, we end up with Y <t t 1 being more accurate than each domain-specific alternative Y {k} t 1 , k ≤ t.

Degree of Stylization
We propose an additional analysis on the stylization mechanism.Table 14 shows the results of the our method (complete with all objectives) under different degrees of stylization, which are determined by the β parameter (see Sec. 4).We notice that disabling stylization or operating it in a more conservative manner (i.e., with β = 0.001) yields low results, with the latter configuration still outperforming the no stylization approach, as the statistical properties captured and transferred are not sufficient to successfully retain olddomain information.On the other hand, if the stylization is raised to an excessive extent (i.e., with β = 0.1), we observe performance degradation on the overall ∆2 score.In this scenario, artifacts are more likely to be introduced on oldlystylized images, thus hindering the segmentation task.

Knowledge Transfer Across Tasks and Domains
We propose further ablation studies to evaluate the knowledge transfer aptitude of our method, both under task and domain perspectives.Fig. 6 presents a comparative of multiple CIL competitors in terms of predisposition towards domain-knowledge transfer; we report the mIoU achieved on individual domains only on classes experienced so far across multiple steps in matrix form.We consider multiple incremental setups, with urban datasets and variable domain order.We observe that our approach, right from the first learning step, achieves better forward transfer to future domains, as indicated by per-domain mIoU values in the top triangular sections, regardless of the setup considered.At the same time, this translates into superior performance on current domains (represented by diagonal mIoU values), as they benefit from a better forward-adaptability acquired before.Plus, improved backward transfer to former domains is testified by higher mIoU values in the bottom triangular part of matrices.
To provide an insight on task-knowledge transfer proneness of different incremental methods, in Fig. 7 we report a comparative in terms of ∆ results at multiple learning steps; values are computed on single incremental sets of classes and represent an average score across all domains (both experienced and future ones).The experimental setups are the same considered when studying domain transfer, and results are arranged in matrix form.We observe that our ∆ scores in the bottom triangular part of matrices are lower than competitors, suggesting that our method yields better backward transfer in terms of task knowledge.At the same time, the smaller ∆ diagonal elements indicate improved performance on current tasks, confirming the better stability-plasticity compromise offered by our approach.

CONCLUSIONS
In this paper, we formalized a general setting for continual learning, where both domains and tasks to be learned  incrementally change over time.We addressed this underexplored learning setting targeting the semantic segmentation task by breaking it down into underlying sub-problems, each tackled with a specific learning objective.Leveraging a stylization mechanism, domain knowledge is replayed over time, whereas a robust distillation mechanism allows to retain and adapt old-task information.Overall, the proposed learning framework enables learning new tasks, while preserving performance on old ones and spreading task knowledge across all the encountered domains.We achieved significant results outperforming state-of-the-art competitors on multiple challenging benchmarks.Further research will tackle even more application-oriented settings, i.e., where task and domain shifts happen in a continuous fashion rather than in discrete steps and distinct overlapping sets of classes are introduced in different domains.

Fig. 1 :
Fig. 1: High-level view of our approach.Transparency decrease (top down and left right) indicates progression through learning steps.Colored task icons denote presence of supervision within training data, grayscale ones signal lack of supervision.At each step, we leverage training data to learn new classes on the new domain.Domain stylization allows to reiterate old-domain distribution, crucial to learn new tasks and preserve old ones on former domains, and to adapt old-domain old-task knowledge to new domains.

Fig. 2 :
Fig. 2: Overview of the class and domain incremental setup.At each step, training data come from a new domain and is labeled on a new class set.When testing, performance is measured on all domains and classes experienced so far.

1 .Fig. 3 :
Fig. 3: Model architecture: we decompose class and domain IL into simpler sub-problems, each addressed by a suitable objective (4 panels in the right side); to access no longer available old domain data, we resort to stylization (left side).

Fig. 4 :
Fig. 4: Qualitative results on CS BDD IDD domain setup and C bgr C stat C mov class setup.

Marco
Toldo received the M.Sc.degree in ICT for Internet and Multimedia in 2019 at the University of Padova.At present, he is doing his Ph.D. at the Department of Information Engineering of the same university.In 2021, he did an internship as Research Engineer at Samsung Research UK.His research involves domain adaptation and continual learning applied to computer vision.Umberto Michieli received his Ph.D. in Information Engineering from the University of Padova in 2021.Currently, he is a Postoctoral Researcher and Adjunct Professor at the same University.He spent research periods at Technische Universit ät Dresden and Samsung Research UK.His research lies at the intersection of foundation AI problems applied to semantic understanding.In particular, he focuses on domain adaptation, continual learning, coarse-to-fine learning and federated learning.Pietro Zanuttigh received the the Ph.D. degree from the University of Padova, Italy in 2007.He is currently an Associate Professor at the Department of Information Engineering.His research interests are image and 3D data processing and analysis, with a special focus on domain adaptation and incremental learning in semantic segmentation, ToF sensors data processing and hand gesture recognition.

TABLE 1 :
Training objectives: the n/o superscripts denote the use of new/old domain data, with• implying stylization.

TABLE 3 :
Class and domain incremental sets.Twilight, Night} C s indicates that the class subset is derived from Shift's original set.

TABLE 4 :
Experimental results on CS BDD IDD domain setup and C bgr C stat C mov class setup.

TABLE 5 :
Experimental results on BDD IDD CS domain setup and C bgr C stat C mov class setup.BDD ) IDD ) CS C bgr ) Cstat ) Cmov

TABLE 6 :
Experimental results on IDD CS BDD domain setup and C bgr C stat C mov class setup.IDD ) CS ) BDD C bgr ) Cstat ) Cmov

TABLE 8 :
Experimental results on CS BDD IDD domain setup and C bgr C mov C stat class setup.

TABLE 9 :
Experimental results with DeeplabV3-ResNet101. mov experienced before static ones C stat .We notice a similar trend to that observed in Table4(i.e., same domain order, but different class order), with baselines and MDIL CS ) BDD ) IDD Method C bgr ) Cstat ) Cmov L n ce L ñ ce MiB PLOP UCD LwS Oracle

TABLE 10 :
Experimental results on the Mapillary dataset.EU ) NA ) AS ) OC ) AF ) SA

TABLE 11 :
Experimental results on the Shift dataset.

TABLE 12 :
Ablation study on the contribution of loss components.The L n kd notation here implies that pseudo-labels are generated leveraging new-domain input samples.