Temporal-Domain Adaptation for Satellite Image Time-Series Land-Cover Mapping With Adversarial Learning and Spatially Aware Self-Training

Nowadays, satellite image time series (SITS) are commonly employed to derive land-cover maps (LCM) to support decision makers in a variety of land management applications. In the most general workflow, the production of LCM strongly relies on available GT data to train supervised machine learning models. Unfortunately, these data are not always available due to time-consuming and costly field campaigns. In this scenario, the possibility to transfer a model learnt on a particular year (source domain) to a successive period of time (target domain), over the same study area, can save time and money. Such a kind of model transfer is challenging due to different acquisition conditions affecting each time period, thus resulting in possible distribution shifts between source and target domains. In the general field of machine learning, unsupervised domain adaptation (UDA) approaches are well suited to cope with the learning of models under distribution shifts between source and target domains. While widely explored in the general computer vision field, they are still underinvestigated for SITS-based land-cover mapping, especially for the temporal transfer scenario. With the aim to cope with this scenario in the context of SITS-based land-cover mapping, here we propose spatially aligned domain-adversarial neural network, a framework that combines both adversarial learning and self-training to transfer a classification model from a time period (year) to a successive one on a specific study area. Experimental assessment on a study area located in Burkina Faso characterized by challenging operational constraints demonstrates the significance of our proposal. The obtained results have shown that our proposal outperforms all the UDA competing methods by 7 to 12 points of F1-score across three different transfer tasks.


I. INTRODUCTION
T ODAY, satellite imagery represents a fundamental source of information to monitor the dynamic of the Earth surface providing valuable knowledge to support decision makers in several application domains [1]. Recent spatial programmes (i.e., the European Union's Copernicus programme and its Sentinel missions) provide open access satellite imagery with both high spatial resolution as well as high revisit frequency. They capture satellite image time series (SITS) data that can be leveraged to monitor phenomena in a variety of different domains, such as ecology [2], agriculture [3], forestry [4], and natural habitat monitoring [5].
SITS data, conversely to monodate imagery, contain signal information about the evolution of the Earth surface allowing, for instance, to distinguish vegetated land covers that evolve differently over a yearly cycle of seasons (e.g., in agriculture, different cropping practices exhibit a different dynamic in their radiometric signal over a growing season).
Among the possible use of SITS data, the production of landcover maps (LCMs) over a specific region [6] is of paramount importance. The increasing availability of SITS data along with advances in machine learning [7], more precisely deep learning [8], has led to land-cover mapping systems that take largely profit of the information carried out by time series of remote sensing imagery.
Nonetheless, supervised machine learning methods require large amount of reference [or ground truth (GT)] data to be trained, hence posing serious challenges to their use in situations characterized by a reduced amount of, or unavailable, reference data. For instance, when LCMs have to be updated from previous years, costs or restrictions related to new field campaigns can prevent the possibility to collect new reference data, thus hindering to learn an up-to-date classification model [9].
An ideal solution would be to reuse already available data on a study site, for instance collected in previous field campaigns or shared over the past years by some public/government agency, to save time and money for the production or update of LCMs. This option can, on one hand, take advantages of the efforts previously done and, on the other hand, limit the needs of fresh reference data on a study area whose accessibility may be reduced or compromised. This  However, in the specific, yet common case in which significant land-cover changes occur over a certain reference period, i.e., for agricultural landscapes with year-to-year crop type changes among fields, simply training a new model using up-todate images and legacy reference data is not a solution, and the use of transfer learning strategies becomes urgent.
We here start from the observation that directly transfering a model trained on a particular year (the source domain) to a successive period of time (the target domain) can be challenging since the two time periods can be affected by different environmental, weather, or climate conditions [10], [11]. This results in differences or shifts in the distributions of the acquired yearly remote sensing data.
Addressing the distribution shift problem to adapt a model trained on a source domain to an unlabeled target domain is known as unsupervised domain adaptation (UDA) [12] in the general field of machine learning. The UDA approach has the objective to provide methods and strategies to cope with distribution shifts between the data on which the model is trained (source domain) and the data on which the model is deployed (target domain) [13].
Here, we consider the tUDA (or tUDA) problem where data are SITS and the task is to provide yearly land-cover mapping. The goal is to train a classification model capable to provide a reliable LCM using an image time series on a given year for which no specific GT is provided (target domain) as well as both SITS and sparsely annotated GT data from a previous year (source domain).
When dealing with real-world land-cover mapping, the collected GT is generally sparse due to the operational constraints related to time and efforts associated to field campaigns [14], [15]. This means that a limited number of polygons (in terms of surface with respect to the study site) is annotated by field experts with the aim to have samples covering the whole study area. Matter of fact, the common operational GT data collection protocol prevents the use of standard semantic segmentation approaches due to the fact that the latter requires densely annotated GT data as underlined in [16] and [17] forcing the conceived land-cover mapping solution to work at the pixel [18] or at the parcel [19] granularity.
To cope with the tUDA challenging setting affecting SITSbased land-cover mapping, in this article, we propose spatially aligned domain-adversarial neural network (SpADANN), a framework that combines both adversarial learning and selftraining for tUDA for SITS-based land-cover mapping under sparsely annotated GT data. More precisely, SpADANN leverages adversarial learning with the aim to extract domaininvariant features and it progressively transfers the underlying classification model from source to target domain via selftraining. With the aim to leverage the peculiarity of remote sensing data, the self-training process generates pseudolabels on the target domain identifying stable spatial areas between the two considered years (domains) and use such spatial areas (anchor points) to further alleviate the distribution shift between domains. In addition, with the goal to explicitly cope with the temporal dimension charaterizing SITS data, we leverage 1-D convolutional neural networks (NNs) as backbone of our framework. The extensive experimental evaluations are carried out to assess the behavior of SpADANN considering state-ofthe-art UDA approaches and assessing both quantitative and qualitative aspects on a rural study site located in Burkina Faso, referred as Koumbia site and characterized by a mostly agricultural land-cover nomenclature (crop types as well as natural and built-up classes). The associated GT data are highly sparse due to operational constraints related to labor-intensive and costly field campaigns spanning the year 2018, 2020, and 2021.
The rest of this article is organized as follows. Section II presents the related literature in SITS-based land-cover mapping, self-training, and domain adaptation. Section III describes the tUDA problem setting and introduces the proposed SpADANN framework to cope with tUDA for SITS-based landcover mapping. Section IV presents the study site and the associated data while the experimental evaluation is reported in Section V. Section VI discusses the obtained results and shortterm follow-ups. Finally, Section VII concludes this article.

A. SITS-Based Land-Cover Mapping Under Sparsely Annotated GT Data
Land-cover mapping from SITS data is of paramount importance to monitor and characterize spatiotemporal phenomena occurring on the Earth surface, i.e., quantify natural resources [20], estimate agricultural surfaces [21], or assess human settlement evolution [7]. Inglada et al. [7] propose an operational framework to perform large-scale land-cover mapping at national scale from time-series data. The classification is achieved via the random forest (RF) classifier that, still today, represents a well-established approach for land-cover mapping-based from SITS data. Ienco et al. [22], Rubwurm and Korner [23], and Minh et al. [24] deal with land use and land-cover (LULC) mapping via recurrent NN approaches. In both [22] and [23], SITS data are managed via long short-term memory, while Minh et al. [24] deal with LULC mapping, still considering recurrent NN strategies but, this time, the performances of the gated recurrent unit were inspected to perform classification. Pelletier et al. [3] propose the use of 1-D (temporal) convolutional NNs for SITS-based land-cover mapping, referred as TempCNN. In this model, the convolutional operator is performed on the temporal dimension of the SITS data with the purpose to manage and model shortand long-time correlations. The conducted study highlights the appropriateness of such approach w.r.t. the previous proposed strategies in the context of general LULC mapping from SITS data. Furthermore, Zhong et al. [25] provide a comparison of both recurrent and convolutional NN for the classification of summer crops highlighting that the latter approach achieves the best performances in their study case. More recently, Garnot et al. [26] propose the pixel-set encoder temporal attention encoder, a transformer-based strategy equipped with a pixel-set encoder and a self-attention module for agricultural parcels classification.
Despite the recent progress in the field of SITS-based landcover mapping, the proposed algorithms still struggle to manage Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. data coming from different temporal periods, thus limiting their applicability in a temporal transfer scenario.

B. Self-Training Methods
Self-training [27] can be seen as a particular case of semisupervised learning [28], where a machine learning model is trained using a reference dataset composed of a small set of labeled samples and a big amount of unlabeled ones. More precisely, in the self-training setting, a model is trained iteratively by assigning pseudolabels to the set of unlabeled training samples and, successively, enriching the current labeled training set with pseudolabeled samples on which the model exhibited a high confidence. Cotraining [29] is one of the earliest and widely popular techniques that have been proposed in the context of self-training learning. In cotraining, examples are defined by two views that are decorrelated to each other. The goal of learning is to train a classifier on each view by first initializing it with the available labeled training data. Then, one of the classifiers assigns pseudolabels to unlabeled data, which the other one will use to learn. Following training, the classifiers switch roles, with the learned classifier assigning pseudolabels to unlabeled examples, which will then be used to train the first classifier. This procedure continues until there are no more unlabeled instances to be pseudolabeled. Tritraining [30] is a direct extension of the cotraining approach in which three classifiers from the original labeled set are generated. These classifiers are then refined using unlabeled examples in a tritraining process. In detail, in each round of tritraining, an unlabeled example is labeled for a classifier if the other two classifiers agree on the pseudolabels. Another popular self-training techniques is mean teacher [31]. This method employs two NNs as supervised classifiers, one of the models is named teacher, while the other is called student. These two models are structurally identical, and their weights are related in which the teacher's weights are an exponential moving average of the student' weights. In this scenario, the student model is the only one that is trained over the labeled training set and a consistency loss is computed between the teacher's probability distribution prediction and the student's one.
Banerjee et al. [32] proposed a clustering-based approach to perform LULC mapping from multispectral satellite image data in an unsupervised fashion. More in detail, first, pixels are clustered together, then a clustering label procedure is employed to obtain initial labeled samples, and finally the machine learning classifier is iteratively trained via self-training. More recently, Wu et al. [33] propose to cope with hyperspectral image classification under the lens of self-training. The authors exploit self-training to alleviate issues related to tedious and time-consuming process of data annotation. In addition to only use classifier confidence to select pseudolabels, the proposed approach leverages spatial consistency (in terms of spatial neighborhood) to correct possible mistakes in the training enrichment step. Paris et al. [34] introduce a framework that combines self-training (referred as self-pace learning) and active learning in order to iteratively enrich an initially small labeled training set with informative samples for LULC classification of SITS data via support vector machines on the google Earth engine platform.
Self-training methodologies are receiving more and more attention due to their ability to train machine learning models in a data paucity scenario. While many frameworks have already been proposed for image or scene classification [35], only few research studies have leveraged self-training for time-series analysis [36] or SITS-based land-cover mapping [34]. Furthermore, all the research studies associated to time-series analysis work in a classical context where no domain shift exists between the training and target data.

C. Unsupervised Domain Adaptation
UDA [12] methods belong to the family of transfer learning approaches [37], which has the main objective to transfer a model trained on a labeled source domain to an unlabeled target domain. Recent advances in UDA focus their efforts to extract domain-invariant features by either aligning domains through data transformation or perform adversarial training with the aim to reduce the distribution gap between the source and target domain [12]. Regarding the first category of methods, the one that aligns domains through data transformation, Zhuang et al. [38] proposed a geodesic flow kernel (GFK) based strategy to align source and target data distributions. The method allows to project both source and target data into a shared, low-dimensional space in which the distribution shift between the two domains should be reduced. Since GFK only provides the low-dimensional data projection, a standard supervised model needs to be successively trained to perform the final classification on the target data. Concerning domain-invariant approaches based on adversarial training, Tzeng et al. [39] define the adversarial discriminative domain adaptation (ADDA) method. Inspired by the concept of generative adversarial network, this approach set up a two-player learning game where a discriminator network tries to distinguish between source and target sample representations derived by the generator while the generator tries to fool the discriminator network. Currently, adversarial learning is one of the main trends when it comes to UDA.
Still based on the adversarial training principle, Ganin et al. [40] introduce the domain-adversarial neural network (DANN) model where a standard NN model is augmented with a domain classifier that may distinguish between source and target samples in a multitask learning setting. The domain classifier is associated with a gradient reversal layer (GRL) that enforces the features extracted by the encoder to be invariant w.r.t. the domains. The CDAN+E approaches [41] extend the DANN framework conditioning the discriminator on the prediction of the classification network for source and target data and it introduces an entropy regularization to prioritize the transfer of easy-to-transfer samples. This should, in theory, focus the source-target matching of instances belonging to the same class. The GRL principle introduced in the DANN framework is also the core of more recent UDA approaches as the margin disparity discrepancy (MDD) [42] and the adversarial-learned loss for domain adaptation ALDA [43] frameworks.
Concerning the remote sensing field, early research focused on proposing UDA strategies for high spatial resolution images [44], while only recently some strategies are emerging in the context of SITS [11]. More generally, in this context, distribution shifts between training (source) and test (target) data can be induced by different factors, and among others, differences in sensor acquisitions and environmental conditions are the most recurrent ones. However, such differences can be related to either the geographical shift from a study site to another one [45] or the temporal delay among acquisitions for data covering the same area in two different periods [10].
Regarding differences in sensor acquisitions, Wang et al. [46] propose a cross-sensor UDA framework to cope with spatial and spectral distribution shifts between airbone and spaceborne very high spatial resolution (VHR) imagery with a specific focus on urban LCM. The domain adaptation process leverages a self-training approach to transfer a classification model from the source to the target domain. Concerning differences in environmental condition, Chen et al. [45] propose an adversarial-based strategy to adapt a semantic segmentation model to be transferred from a spatial location to a different one. The method is conceived, also in this case, to cope with monodate VHR imagery mainly considering urban land-cover mapping.
Related to SITS-based land-cover mapping, very few domain adaptation frameworks exist. Focusing on spatial transfer, Wang et al. [47] propose a framework based on recurrent NN and maximum mean discrepancy (MMD) principle in order to embed both source and target SITS pixels in a common shared space. This is achieved by using an encoder per source and the MMD strategy to align the two domains. In [48], the combination of a transformer encoder-based classifier and the DANN strategy with GRL is evaluated to cope with spatial transfer learning in the context of land-cover mapping from SITS data. More recently, Nyborg et al. [11] propose a framework to cope with agricultural parcel mapping under the objective to achieve spatial transferability. The approach combines together a module to align time-series information based on the estimation of time shift between SITS coming from the source and the target domain and a self-training strategy in order to adapt the model to samples coming from the target domain. For the case of temporal transferability, Tardy et al. [10] perform preliminaries investigation via optimal transport baselines for the case of tUDA from multiple source domain (multiple annual SITS) to a specific target domain (annual SITS). The obtained findings reveal that the use of optimal transport baselines results in a low-level accuracy with respect to the use of a direct transfer of a supervised classifier from the source to the target domain, thus underlying that the problem of temporal transfer is quite complex and the advanced methods are needed.
The extensive literature review we have performed clearly underlines that recent UDA approaches, especially the ones based on deep learning strategies, are still unexplored and underexploited in the context of UDA for SITS analysis. More in detail, a major lack is related to frameworks and methodologies addressing the important challenge related to tUDA on which we set the focus of this work.

III. SPADANN
In this section, we introduce our proposed framework SpADANNwith self-training to deal with tUDA for SITS-based land-cover mapping. We first provide the problem setting, then we give an overview of SpADANN. Successively, we supply the details of the different components on which SpADANN is built on.

A. Problem Setting
In this work, we consider the problem of tUDA. We are giving a source domain with n s and n t the number of samples for the source and target domain, respectively. We indicate with X s , Y s , and X t the set of source samples, source labels, and target samples, respectively, Each sample x i ∈ R T ×B is a SITS pixel defined over T timestamps and characterized by B spectral bands. The land-cover information (y s i ) is only available for the source domain and y s i ∈ {1, . . . , K} can take one value between 1 and K, with K the number of land-cover classes on which the multiclass classification problem is defined. The set of SITS pixels belonging to the two domains covers exactly the same spatial area but at different periods of time (i.e., different years). This means that the same spatial location is covered by a SITS pixel coming from the source domain as well as one coming from the target domain. Thus, n s is equal to n t and location(x s i ) is equal to location(x t i ), where location(·) is a function providing the spatial location of a SITS pixel in terms of geographical coordinates. Due to differences in environmental, weather, or climate acquisition conditions between the pixel SITS belonging to the source (D s ) and the target (D t ) domain, distribution shifts can affect the two sets of data, thus impacting the performances of standard inductive supervised classification approaches [10], [11], [12].
Here, the goal is to train a robust (in terms of data distribution shifts) SITS-based land-cover mapping model that exploits both source (D s ) and target (D t ) domain information with the aim to predict, for a given pixel x t i belonging to the target domain (D t ) the corresponding land-cover class y t i . We remind that the set of land-cover classes of the target domain spans exactly the same set of land-cover classes of the source domain, as in a general closed-set scenario [49].

B. SpADANN Overview
Hereafter, we provide a general overview of our framework, with the aim to supply a picture of how SpADANN behaves as well as describe the general principles behind it. Fig. 1 visually sketches the SpADANN framework.
SpADANN combines both adversarial learning and selftraining with the aim to learn an invariant representation space (features) with respect to possible distribution shifts between source and target domains (in our case pixel, SITS coming from two time periods-years-covering exactly the same geographical area) and progressively transfer the underlying While the adversarial learning strategy is based on the model proposed in [40] that we adapt for the special case of time-series data, the pseudo-labeling procedure deeply exploits the features characterizing the tUDA problem. The pseudolabel selection is based on the fact that the source and the target domains are spatially aligned (i.e., they cover exactly the same geographical area). More precisely, given two spatially aligned pixels SITS (location(x s i ) = location(x t i )), where the (x s i ) comes from the source domain and the (x t i ) comes from the target domain, if the land-cover classifier provides the same decision for both pixels , the prediction of the land-cover classifier L and the predicted class for the source pixel SITS (x s i ) is the correct one, then the target pixel SITS (x t i ) is associated with the pseudolabel generated by the land-cover classifier. Finally, as the iterative training procedure goes on, pseudolabel information gets more importance with the aim to progressively transfer the underlying classification model from the source to the target data.

C. Adversarial Learning
With the aim to extract SITS pixel representations that are invariant to the particular domain they come from (source or target), we adapt the strategy proposed in [40], namely DANN as backbone block in the SpADANN framework. Fig. 2 depicts the architecture of the proposed backbone network.
The network architecture has three components, an encoder network F relying on the Θ F parameters, a land-cover classifier network L with parameters Θ L , and a domain classifier network D with parameters Θ D . Due to the fact that we are dealing with SITS pixels, we adopt an encoder model especially tailored for such kind of data, namely the TempCNN model [3], due to its confirmed ability to cope with the task of SITS-based land-cover mapping in a standard in-domain setting through 1-D convolution on the time dimension.
The SpADANN backbone is a multioutput network that has the objective to generate a new data representation via the encoder F ensuring high land-cover classification accuracy and, simultaneously, making difficult to distinguish between the domain each SITS pixel comes from.
The DANN loss function is defined as follows: where L c (X s , Y s |Θ F , Θ L ) is the loss associated to the landcover classification problem modeled with standard categorical cross-entropy function [40], Internal classification architecture of SpADANN. It is based on the DANN adversarial learning strategy [40] coupled with the TempCNN [3] encoder to customize the architecture for the special case of SITS data. The model has three components, an encoder network F (the TempcNN model), a land-cover classifier network L, and a domain classifier D. The multitask network has the objective to generate a new data representation via the encoder F that has high land-cover classification accuracy (maximizing the L performances) and, simultaneously, make difficult to distinguish between the domains the SITS pixels come from (confusing the D component).
loss related to the domain classifier modeling a binary classification problem in which class label represents the possibility to belong exclusively to the source or the target domain. Also in this case, the categorical cross-entropy function is employed. Finally, the hyperparameter λ controls the influence of the domain classifier loss on the learnt features. In order to leverage standard stochastic gradient descent to optimize the L DANN loss function, the L c (·|·) loss is optimized as commonly done for general NN models, while for the L Adv (·|·) loss, we employ the GRL trick [40]. More in detail, the GRL acts as the identity transform during the forward propagation pass, while it reverses the gradient (the gradient is multiplied by -1) during the backward propagation pass when the gradient is exploited for the update of the encoder F weights. In this way, the GRL trick allows to implement the adversarial training strategy with a standard backpropagation of the gradients without adding any extra parameters to the model. More precisely, the domain classifier parameters are updated in a standard way with the aim to support the model to distinguish between source and target samples, while the reversed gradient applied to the encoder network forces the model to generate domain-invariant features with the goal to fool the domain classifier [48].

D. Spatial Consistent Pseudolabeling
In order to further adapt the SITS classification model, as presented in Section III-C, to effectively classify pixels coming from the target domain, we leverage a self-training strategy that allows to associate pseudolabels to a subset of data coming from D t . This is done with the aim to inject, in the training process, pseudo-supervision on the target domain permitting to the landcover classifier subnetwork L to tackle the classification of SITS pixels coming from D t .
In a standard self-training pipeline [50], given a set of unlabeled samples, the model output distribution is employed to select a subset on which the model has high confidence. Successively, such pseudolabeled samples are used to enrich the current training set as the training process proceeds. More precisely, this mechanism is implemented by defining a threshold on the model output softmax and, subsequently, choose all the samples on which the value of the most probable prediction is greater than the defined threshold. This widely adopted process suffers from the fact that a threshold needs to be defined and the way this hyperparameter is set can drastically affects the performance of the underlying sampling process [51].
In our case, we leverage the specificity of the land-cover mapping tUDA problem conceiving a process based on the spatial consistency between the two SITS pixels x s i and x t i sharing the same spatial location (location(x t i ) = location(x s i )). Such a strategy provides a solution to the pseudolabeling selection process that avoids the definition of any kind of threshold, thus reducing possible hyperparameter tuning associated to our framework. More precisely, the set of target pixels to which pseudolabels will be associated is chosen based on two criteria that need to be met simultaneously. The first criteria is based on spatial consistency as described below: location(x s i )) and the second criteria requires that the land-cover classifier L supplies the correct prediction for the source sample The idea behind this selection process is to choose target samples that remain stable, in terms of model output prediction, w.r.t. the corresponding source pixel in terms of spatial location, and simultaneously, we enforce the fact that the model predicts the correct land-cover class on the source sample x s i . In this way, the procedure allows to select pseudolabeled samples that act as anchor points between the source and the target domains exploiting the model output stability and, at the same time, leveraging target samples that are in principle characterized by a small distribution gap, thus more effective to support the classification model transfer from the source to the target domain.
More formally, we can define the loss associated to the pseudolabeled samples as follows: where 1 cond is an indicator function that returns 1 if the condition cond is verified and 0 otherwise, Cl prob (·) provides the model output distribution over the possible land-cover set, H(·, ·) is the classical categorical cross-entropy function,Ŷ t is the whole set of possible pseudolabel for the target domain, andŷ t i is the pseudolabel land-cover class with the highest model output probability w.r.t. Cl prob (x t i ) for the pixel x t i coming from the target domain.

E. SpADANN Training Procedure
In this section, we introduce the general training procedure used to optimize the parameters of the SpADANN framework. Algorithm 1 briefly summarizes the pseudocode of the training procedure. The inputs of the procedure are constituted by the data coming from the source (D s = (X s , Y s )) and the target (D t = (X t )) domains, the hyperparameter β associated to the progressive transfer strategy and N e , the number of epochs associated to the learning procedure.
First, the current classification model is applied to the target data X t in order to obtain the set of possible pseudolabelsŶ t (line 3). Then, we compute the tradeoff value α as a linear function of the current epoch (e), the total number of epochs (N e ), and the input hyperparameter β (line 4). The α value is subsequently employed to weight the contribution of the L DANN and the L p losses in a convex combination of the two terms with the aim to vary their importance during the learning procedure (line 5). More in detail, at the beginning of the procedure, the α value starts from zero and linearly increases with the objective to, progressively, give more importance to the L p term. This is done since at the early iterations of the procedure, we want that the model exploits as much information as possible from the labeled source domain while learning SITS pixels' representations that are invariant w.r.t. the specific domain. The reason is that, at the first iterations, the trained model is not yet Require: X s (the source SITS pixels), Y s (the source labels), X t (the target SITS pixels), β (the progressive transfer hyperparameter), N e (the number of epochs). Ensure: Θ F (param. of the encoder), Θ L (param. of the land-cover classifier).
effective so that the prediction on the target data could be highly biased. As the learning procedure goes on, the α value increases, hence decreasing the importance of the first term, while increasing the weight of the second one This mechanism implements a kind of progressive transfer from the first to the second term during the learning procedure, allowing the underlying classification model to smoothly focus on the specificity of the target SITS pixels via the use of the pseudolabels selected as described in Section III-D. The hyperparameter β controls the range of the α tradeoff value with the aim to avoid the latter to get extreme values that can completely move the learning process toward the target domain, resulting in a degeneration of the behavior of SpADANN. For this reason, β is supposed to range between 0.5 and 1. After that, the current loss L TOT is computed, the network weights Θ F , Θ L , and Θ D are updated by minibatch stochastic gradient descent (line 6). At the end of the training procedure (line 9), the network weights Θ F and Θ L associated to the encoder F and the land-cover classifier L are returned as output of the training process associated to our framework. This set of parameters represents the classification model that will be finally employed to provide the land-cover mapping predictions on the SITS pixels coming from the target domain.

IV. DATA
The study site covers an area around the town of Koumbia, in the Province of Tuy, Hauts-Bassins region, in the south-west of Burkina Faso. This area has a surface of about 2 338 km 2 and is situated in the subhumid sudanian zone. The surface is covered mainly by natural savannah (herbaceous and shrubby) and forests, interleaved with a large portion of land (around 35%) used for rainfed agricultural production (mostly smallholder farming). The main crops are cereals (maize, sorghum, and millet) and cotton, followed by oleaginous and leguminous. Several temporary watercourses constitute the hydrographic network around the city of Koumbia.

A. Satellite Image Time Series
We collected SITS of Sentinel-2 imagery spanning the years 2018, 2020, and 2021, amounting for a total of, respectively, 35, 41, and 39 available scenes. Based on the available acquisitions, we conducted a visual analysis and we select 24 images for each year. Acquisitions are selected in order to account for a uniform temporal distribution among the three years. The main selection criteria used were filtering out images that were visually impacted by cloud coverage and keep a sufficient amount of acquisitions over the rainy (cropping) season, occurring between May and October. Fig. 4 depicts the acquisition dates of the three Sentinel-2 SITS.
All images were provided by the THEIA Pole platform 1 at level-2 A, which consist in atmospherically corrected surface 1 [Online]. Available:http://theia.cnes.fr reflectances (cf. MAJA processing chain [52]) and relative cloud/shadow masks. Only 10-m spatial resolution bands (Blue, Green, Red, and Near-infrared spectrum) were considered in this analysis in order to limit the computational burden related to the experimental assessment. A standard preprocessing was performed over each band to replace cloudy pixel values as detected by the available cloud masks based on the method proposed in [53]. In this preprocessing, the value of a cloudy pixel (w.r.t cloud/shadow mask and a threshold of 0) is linearly interpolated considering precedent and posterior acquisitions.

B. GT Data
GT data for 2018, 2020, and 2021 have been derived from a large agricultural land-cover dataset available online [54], mainly consisting of field data collected by local experts on several sites all over the tropics. For the Koumbia site, these field surveys were conducted yearly around the growing peak of the cropping season from 2013 to 2021. GPS waypoints were gathered following an opportunistic sampling approach Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. along the roads or tracks according to their accessibility while ensuring the best representativity of the existing cropping practices in place. Records were also provided on different types of noncrop classes (e.g., natural vegetation, settlement areas, and water bodies) to allow differentiating crop and noncrop classes. Moreover, some additional noncrop reference polygons are also provided, obtained by photointerpretation of VHR (SPOT 6/7 and PLEIADES) optical satellite images.
Our final GT has been assembled in a geographic information system vector file, containing a collection of polygons, each attributed with a land-cover category based on information reported in the original database. Statistics about the yearly reference datasets used here are reported in Table I. In order to ensure consistency with the proposed method, we kept the exact same surface for the three reference years by performing a year-by-year intersection of the polygons of the original database.
This also allows measuring the changes occurring in the GT from one year to another, which are obviously more important on crop classes due to the presence of cropping cycles in this type of agricultural system. It is important to note that, for year 2018, the surface of cotton crop is about two times that of oleaginous/leguminous when this ratio is balanced for years 2020 and 2021. Fig. 5 quantifies these changes in terms of land-cover classes between each couple of reference years. They indeed highlight the presence of cereals, cotton, and oleaginous/leguminous crops on the same agricultural parcels over the years. Conversely, bare soil/built-up and water classes remain unchanged, and few changes occur on noncrop classes, mainly due to the occasional shifts in the density of natural vegetation or conversion to active cropland (e.g., 16% of grassland in 2018 became shrubland-10%-or was converted to cereal crops-6%). Fig. 6 shows the normalized difference vegetation index (NDVI) profiles over 2018, 2020, and 2021 for the three agricultural classes: cereals, cotton, and oleaginous/leguminous. We can observe that the profiles over the years are similar. In the case of cereals and cotton classes, 2018 profile presents its value peak about ten days earlier than for 2020 and 2021.
Finally, Table II reports the average (and standard deviation) percentage associated to how many times a labeled pixel is covered by clouds in the whole time series as well as considering the portion of the time series covering the growing and harvesting stages (between days 150 and 300). This latter period is of particular interest in order to distinguish between the crops' classes. Inspecting Table II, we can see that the 2018 SITS data exhibit the highest percentage of cloudiness when statistics are computed considering the whole time series of 24 dates. Although when only the period covering both growing and harvesting stages is considered, we can clearly observe that both 2018 and 2021 SITS data are more affected by cloud coverage than the SITS data from 2020. The cloudiness diversity is probably due to the differences in the environmental and climatic conditions that have affected the study areas in three considered periods. More precisely, the three considered years are affected by nonhomogeneous weather conditions that result in a heterogeneous level of noncloudiness per time series.
These preliminaries analysis clearly indicates the presence of interannual differences in environmental, weather, or climate conditions that can challenge the "naive" transfer of supervised machine learning models from one year to another one.

V. EXPERIMENTS
In this section, we describe and discuss the experimental results obtained on the study site introduced in Section IV. We carried out several experiments with the aim to provide an extensive analysis of the performance of SpADANN. We investigate different aspects: we perform an in-depth analysis of the performance of SpADANN with respect to competing methods; and we provide a qualitative evaluation through the visualization of the internal representation learnt by our framework and the exploration of the LCMs.

A. Competing Methods
With the aim to compare the performance of SpADANN to state-of-the-art UDA strategies, we consider the following competitors.  1) The GFK approach introduced in [38]: This approach leverages a kernel-based method that projects both source and target information in a low-dimensional manifold.
Since GFK only provides the low-dimensional data projection, a standard supervised model needs to be successively trained to perform the final classification on target data. To this end, we couple the GFK with a multilayer perceptron as well as a RF classifier. We indicate the former approach with GFK-MLP and the latter with GFK-RF. 2) The ADDA method proposed in [39]: This approach employs adversarial learning with a two players game (discriminator and generator) in order to learn invariant representations w.r.t. the domain. Due to the fact that we are dealing with multivariate time-series analysis, we use as backbone the TempCNN network [3] that was especially designed to perform land-cover mapping from SITS data.

3) The DANN method originally introduced in [40]: This
is a standard UDA approach that exploits GRL in order to obtain data representations that are invariant to the particular domain they come from. Also, in this case, we use as backbone model the TempCNN network. This competitor can indeed be considered as an ablation of our proposed framework. 4) The conditional adversarial domain adaptation with entropy conditioning (CDAN+E) approach proposed in [41] upgrades DANN by conditioning the domain discriminator on the classification output and minimizing an entropy loss on target data. We use as backbone the TempCNN network.

5) The MDD method introduced in [42]: This theory-inspired
technique is designed to measure the distribution discrepancy in domain adaptation and it is built on top of the DANN approach. We use as backbone the TempCNN network. 6) The FixMatch method proposed in [35]: This competitor is a state-of-the-art semi-supervised learning approach that exploits consistency regularization between a weak and a strong augmentation of the unlabeled data. We include this competitor in order to further highlight the need for domain adaptation in the temporal transfer task. To adapt FixMatch to time-series data, we follow what is proposed in [11] where identity function (resp. random time steps selection) corresponds to weak (resp. strong) data augmentation. We use as backbone the TempCNN network. 7) The adversarial-learned loss for domain adaptation (ALDA) method presented in [43]: ALDA combines selftraining and domain-adversarial learning to reduce the gap and align the feature distributions by means of a noisecorrecting domain discriminator. We use as backbone the TempCNN network. Moreover, we also consider three baseline strategies in which a supervised classification model is trained with only source data and directly deployed on target data, referred as "only D s "; a supervised classification model is trained on labeled target data and deployed on the rest of the target examples referred as "only D t "; and a supervised classification model is trained on the union of the source data and a portion of the target data and deployed on the rest of the target examples referred as "D s + D t ." The first constitutes a straightforward baseline that does not take into account the necessity to deal with temporal distribution shifts. The second represents the performances we can (theoretically) achieve if we have knowledge about the labels associated to the target domain D t . The third provides a baseline that directly combines all the source domain data with some labeled samples from the target domain in order to assess the possibility to combine data from different domains.
For all the baseline strategies, we consider, as supervised classification methods, both the RF and the TempCNN [3] models. These two models are chosen due to the fact that they are standard and widely adopted methodologies for land-cover mapping from SITS data. More in detail, the former has an established popularity in the remote sensing community due to the accuracy of its classifications [55], while the latter approach is representative of the recent deep learning methods that explicitly manage the temporal dimension that heavily characterizes SITS data [56].

B. Experimental Settings 1) Evaluation of Baseline Strategies:
Concerning the first baseline strategy (only D s ), the supervised classification model is trained over all the source data and then deployed on the target data. Regarding the second baseline strategy (only D t ), solely the target data are exploited. More in detail, target data are split into three parts: training, validation, and test sets following a proportion of 70%, 10%, and 20% of the original target dataset, respectively. Furthermore, with the aim to avoid possible spatial bias in the evaluation procedure [57], we impose that all the pixels belonging to the same object will be exclusively associated to one of the data partition (training, validation, or test). The splitting procedure is repeated ten times and the average results are reported. To what concern the third baseline strategy (D s + D t ), the complete set of data from the source domain is combined with 80% of the target data. The amount of target samples corresponds to the union of training and validation set for the baseline strategy (only D t ). Successively, the learnt classifier is deployed on the rest of the target data. The procedure is repeated ten times (fixing the source data and varying the selected labeled target data) and the average results are reported.
2) Evaluation of UDA Competing Methods: All the UDA models are trained exploiting the whole set of source and target samples with the sole access to label information coming from the source domain.
Concerning the evaluation tasks, according to the data presented in Section IV, we set up three temporal transfer tasks (D s → D t ) where the right arrow indicates the transfer direction from the source (D s ) to the target (D t ) domain: (2018 → 2020), (2018 → 2021), and (2020 → 2021).
To evaluate the different methodologies, once the models are trained, we consider two different scenarios referring to three different tasks. Regarding the evaluation scenarios, we distinguish between the following. 1) We use the same test set as the one employed for the second baseline strategies (only D t ). Following this evaluation, we compare all the approaches to each other. We name such context Subset D t . 2) We use the whole target data D t as test set (this is possible for all the baseline and UDA methods except for the second baseline strategy). We name such a context Full D t . The values of the three SITS benchmarks (2018, 2020, and 2021) were normalized per band in the interval [0,1], with min-max method. The assessment of the model performances was done considering the following metrics: accuracy (global precision), weighted F1-score, and Cohen's Kappa (level of agreement between two raters relative to chance).
3) Implementation Details: For the NN approaches, the training stage has been conducted for 300 epochs, with a learning rate of 10 −4 and a batch size of 32. Batch normalization layers have been inserted after each fully connected or convolutional layer (except for the classification layer). The drop out value is set to 50%. For SpADANN, we set the value of β equal to 0.8 and the value of the hyperparameter λ as suggested by Ganin et al. [40]. In addition, similar to the article presented in [11], SpADANN implements domain-specific batch normalization [58] by processing the source and target minibatches separately. This ensures that batch normalization [59] statistics are computed separately for each domain.
Considering RF classifiers, we optimize the model via the tuning of one parameter: the number of trees in the forest. We vary this parameter in the range {100, 200, 300, 400, 500}. The multilayer perceptron classifier coupled with the GFK approach has two fully connected layers both with ReLU activation function, each one with 512 neurons and followed by batch normalization and drop out layers. A final output layer, with softmax activation function, is employed to perform classification.
Regarding ALDA and according to the recent literature on pseudolabeling in the context of SITS-based land-cover mapping [11], we set the threshold of pseudolabels to 0.9. The same value of pseudolabels threshold is also used for FixMatch. For this latter method, we set the relative weight of the unlabeled loss (λ u ) to 2, the strong data augmentation is implemented by means of the Python TSAUG library 2 via the "dropout" function with parameters p = 0.05 (probability to drop timestamps).
Experiments are carried out on a workstation with a dual Intel (R) Xeon (R) CPU E5-2667v4 (@3.20 GHz) with 256 GB of RAM and four TITAN X (Pascal) GPU. All the deep learning methods are implemented using the Python Tensor-Flow library except ALDA that was implemented in Pytorch based on the original open-source implementation. 3 The MDD and CDAN+E competitors are implemented via the Python ADAPT library [60]. All the models run on a single GPU. The RF is implemented using the Python Scikit-learn library. The code implementation of SpADANN is available at this link. 4 The results, in terms of F1-score, Accuracy, and Kappa, are reported in Tables III, IV,  First, we can notice that, whatever the transfer task is, a direct application of a model learnt on the source domain to data coming from the target domain results in poor performances as expected. This is evident when we compare, for each table, the metric values achieved by the Only D s and Only D t strategies. We can also observe that supervised learning models trained under Only D t and D s + D t strategies achieve very similar performances, thus providing empirical evidences that, when the training set contains samples coming from different distributions, increasing the amount of training data does not result in an increasing of classification performances. All these points clearly indicate that a serious distribution shift exists between two years of SITS data on the considered study site. The highest gap is shown by the task (2018 → 2021) where the best supervised machine learning method (RF) degrades its performances of around 18 points of F1-score.
Second, we can see that, generally, UDA strategies allow to reduce the performances gap induced by the distribution shifts between the source (D s ) and the target (D t ) domains. While this improvement is evident for the deep learning based techniques, it is not so explicit for the GFK approach. This is probably due to the fact that the GFK approach aligns domains independently from the underlying classification task, while all the other approaches perform an end-to-end process that optimizes together the data distribution alignment and the classification process.
Third, we can observe than SpADANN always obtains the best scores among the UDA methods. The gains compared with DANN and CDAN+E, which are the two most competitive approaches for all the transfer tasks, vary between 1.5 and 8.5 points of F1-score. Regarding the transfer task (2018 → 2020), the gap induced by the data distribution shift is largely reduced, especially in terms of accuracy and Kappa score.
Concerning the other two transfer tasks (2018 → 2021) and (2020 → 2021), SpADANN outperforms the supervised classifier approaches when they are trained on the target data (D t ). This unexpected result is tightly related to several factors associated to the SITS data covering the 2021 year that describes the target domain (D t ) in both transfer tasks. More precisely, as highlighted by the statistics reported in Table II, the cloud cover associated to the 2021 SITS data (regarding the GT pixels) is quite high and it affects periods of the year (growing and harvesting stages) that are crucial for the monitoring of the agricultural classes involved in the study site. Due to the gap filling process, we have used to obtain complete time-series data, and the standard supervised machine learning methods could be biased by such synthetic information, thus leveraging the gap filled information in order to derive their decision boundary. Conversely, SpADANN is based on a domain alignment process, implemented via adversarial learning, that forces the whole pipeline to extract invariant features w.r.t. the two domains (D s and D t ). Such quest for invariant characteristics allows our framework to focus on common information, hence reasonably     discarding specific per-year information that can be related to local (in terms of domain) behaviors or artifacts.
Finally, we can observe that, in all transfer tasks, the performances obtained on the Full D t scenario are similar with those obtained by evaluating the method on subsets of the target domain Subset D t . This fact pinpoints that the test subsets extracted from the whole target domain are well representative of the whole target distribution.
1) Per-Class Analysis: In this section, we report and discuss per-class analysis regarding the competing methods on the three transfer tasks we have considered. We first report per-class F1-score and, successively, we examine the different confusion matrices to understand possible interclass mistakes. For this analysis, we focus our attention on the supervised methods (Only We can clearly note that our framework achieves superior performances on the majority of the land-cover classes (grassland, shrubland, forest, bare soil/built-up, and water). Such landcover classes show a more stable pattern among the years, hence exhibiting a smaller gap in terms of distribution shifts to fill between source and target domain. Interestingly, on such classes, SpADANN always achieves better performances than a supervised machine learning model directly learnt on the considered target domain. Concerning the remaining classes (cereals, cotton, and oleaginous), a different pattern is exhibited. When the 2018 year is considered as source domain (D s ), the transfer on the agricultural classes has some issues to achieve performances on pair with the supervised methods trained on the target domain with the (2018 → 2021) task showing better transferability behavior, using SpADANN, than the (2018 → 2020) one. Another interesting point is related to the poor performances that all the UDA methods exhibit for the oleaginous class. This is probably due to the fact that between 2018 and the other two subsequently years, the distribution of the GT data on such class drastically changes (see Table I). More precisely, in 2018, the oleaginous class covers an area of 350 000 m 2 , while in 2020 and 2021, this surface doubles attaining a surface bigger than 730 000 m 2 . This indicate that, in 2018, the oleaginous surfaces are under-represented w.r.t. the other land-cover classes, thus producing a dataset featured by high unbalancedness. Matter of facts, this shift in such agricultural practice significantly affects the capacity of all the models to generalize on the oleaginous land-cover class when 2018 is considered as the source domain. In addition, due to the pseudolabeling procedure associated to our framework, if a class in the source domain is highly under-represented, the same class in the target domain will inherit this feature, with all the possible issues related to learning classification models under imbalance scenarios.
Regarding the (2020 → 2021) transfer task, here SpADANN effectively shows transfer capabilities also on the agricultural classes. The same behavior can be observed for all the other competing methods. These results can be explained, also in this case, by the fact that all the land-cover classes are sufficiently represented in the source domain with a more balanced representation, while class distributions in the source and target domain are more similar to each other (see Table I).
Figs. 10, 11, and 12 depict confusion matrices for the (2018 → 2020), (2018 → 2021), and (2020 → 2021) transfer tasks, respectively. Globally, the confusion matrices confirm the trend observed in the per-class F1-score analysis. All the methods have some troubles in discriminating among the different agricultural classes. As discussed before in Section V-C1, the UDA approaches suffer from class imbalance related to transfer tasks, such as (2018 → 2020) and (2018 → 2021). We can also note that some other coherent confusions arise between grassland and shrubland, and shrubland and forest land-cover classes. This is expected since these three classes refer to three different degrees of density of woody vegetation in natural areas, which vary in a continuous way over the site, making a neat discrimination more challenging.
Despite this fact, SpADANN provides a more visible diagonal structure (the dark red blocks concentrated on the diagonal) than the second best competing UDA method alleviating some of the major confusions exhibited by the competing approaches. Same goes for the (2020 → 2021) transfer task, where SpADANN clearly outperforms the supervised approaches trained on the target domain.
2) Running Time of the UDA Competing Methods: Table VI summarizes the training time of the different UDA methods involved in the experimental evaluation. Beyond the GFK+RF strategy that requires only few minutes to learn its classification    model, all the other methods require between 2 and 15 h with SpADANN demanding around 7.5 h in order to learn its internal parameters. Due to the fact that, in our situation, an LULC classification model demands to be trained once per season (or year), and all the exhibited times remain more than reasonable with respect to the constraints associated to the downstream task.

3) Ablation Analysis:
In this section, we disentangle the added value of the different components on which SpADANN relies. Table VII summarizes the behavior of the different ablations of SpADANN for the (2018 → 2020) task. In particular, we make reference to three specific ablations of SpADANN.
1) SpADANN noST , an ablation of the proposed method without the self-training step in the overall training stage. In this ablation for each source pixel in a batch, the corresponding target pixel (in terms of spatial location) is present in the same batch. In this way, the adversarial learning stage is constrained to extract domain-invariant features for spatially correspondent source and target pixels conversely to what is done during the training of the DANN method where source and target pixels, in a batch, are selected completely at random. Here, L TOT = L DANN . 2) SpADANN T h , an ablation of the proposed method where the selection of pseudolabeled samples is achieved by the traditional thresholding approach [61]. More precisely, during the iterative process, samples from the target domain are associated to pseudolabel if the most confident class predicted by the land-cover classifier has an associated confidence bigger than a specific threshold θ. We set θ equal to 0.9 similarly to what is done for the ALDA and FixMatch approaches. 3) SpADANN onlyC1 , an ablation of the proposed method that removes the pseudolabel condition requiring that the predicted class is equal to the true class (Cl(x s i ) = y s i ) for the self-training stage. Here, the L p loss is redefined as follows: We can first note that SpADANN provides far better behaviors than DANN. This latter can be seen as a baseline ablation of our framework. Second, we can see that no real difference exists between DANN and SpADANN noST . This underlines that the spatial alignment between source and target training batches, alone, does not provide any added value. Moreover, we observe that choosing pseudolabels based on a traditional thresholding mechanism (SpADANN T h ) or only based on the spatial consistency of the model output classification (SpADANN onlyC1 ) degrades the performances. This is probably due to the fact that the condition (Cl(x s i ) = y s i ), in conjunction with the condition (Cl(x s i ) = Cl(x t i )), allows to filter out spurious information, consequently providing more guarantees on the quality of the pseudolabels selected (from the target domain) to enrich the current training set. Finally, we observe that SpADANN always outperforms all its ablations underlying that the interplay among the different components on which it is built eventually provides a robust strategy for the tUDA problem from SITS data.
4) Sensitivity to the β Hyperparameter: In this section, we test the sensitivity of SpADANN to the value of the β hyperparameter. Fig. 13 summarizes the behavior of SpADANN, in terms of accuracy, on the three transfer tasks when the value of β varies between 0.5 and 1.0.
For the transfer tasks (2018 → 2021) and (2020 → 2021), we can note that, generally, as the value of the β hyperparameter increases, the performances of our framework increases as well. The only exception is represented by the transfer task (2018 → 2021) in which a value of β equal to 1 (only consider pseudolabel extracted from the target domain at the end of the training process) degrades the final performances. This is probably due to the fact that, as highlighted by the previous results, a serious distribution shift exists between SITS data coming from 2018 and 2021 so that forcing the learning process to make a complete transfer from the source to the target domain results in a less appropriate classification model. Regarding the transfer task (2018 → 2020), SpADANN exhibits a slightly fluctuating behavior with a variation of less than a point around an accuracy value of 76%.
As empirical rule, we can state that considering the values of the β hyperparameter between 0.7 and 0.9 is the most appropriate choice since this setting can prevent the model to suddenly degenerate due to a complete transition from a source to a target domain characterized by very different data distributions.

D. Visual Analysis
In this part of the experimental evaluation, we conduct some qualitative analysis to assess further the behavior of SpADANN in the case of transfer task (2018 → 2020), other transfer tasks are provided as Tables VIII and IX and are evaluated in Appendix A.
More precisely, we first investigate some extracts related to the LCMs provided by SpADANN and some of the competing approaches and, successively, we visually investigate the internal representations learnt by the involved deep learning models.
1) LCMs: In Fig. 14(b)-(e), maps corresponding to the 2018-2020 transfer task are compared, referred to the scene subset, as depicted in Fig. 14(a). Maps shown here are, respectively, the one obtained using the RF classifier trained on the target domain D t followed by the one obtained through a "naive" transfer (direct transfer without UDA of the RF model trained on source domain D s ), and the two maps obtained through the DANN and SpADANN domain adaptation methods.
Accordingly to what is reported in the quantitative analysis of Fig. 10, the visual analysis confirms that transferring knowledge from 2018 to 2020 is a challenging task, probably due to longer term changes in seasonal vegetation dynamics that appear after a 2 year delay, as well as a redistribution of the proportions among the different crop classes. The main difference concerns the strong underestimation of the oleaginous/leguminous and cotton classes in almost all maps using a transfer approach [see Fig. 14(c)-(e)], to the benefit of the cereal class, with the direct transfer method being particularly destructive. However, if both UDA methods seem to effectively restore the extent of the cotton class, it appears quite evidently that the SpADANN map is less noisy w.r.t. the DANN map, once again confirming a better potential in recovering spatial structures than its competitor.
To better appreciate the spatial precision of the SpADANN maps w.r.t. its direct competitor, we also report some zoomed-in areas in Figs. 15 and 16. In both cases, spatial details better emerge in the maps provided by SpADANN, both over agricultural fields, with more structured and less noisy plots (in terms of salt and pepper error), especially over the cotton and cereals classes and over natural spaces.
The underestimation phenomena related to the oleaginous/ leguminous and cotton classes can be related to the crop class unbalancedness that features the 2018 GT data. More precisely, as shown in Section IV, the collected reference data for 2018 for both oleaginous/leguminous and cotton, in terms of surfaces, are much lesser than the one collected for the cereal class. This evident unbalancedness among crop classes affects both the direct transfer as well as the domain adaptation strategies, thus bringing distribution bias related to the source domain to the target one.
Such effects can be once again better observed in the zoomedin areas of Figs. 15 and 16. This last observation seems to be in an opposite direction to the quantitative results previously reported on the same land-cover classes. This is probably due to the fact that the GT data that we use to both train and validate the different models can only partially represent the study area, in terms of land-cover class distributions. This fact underlines that, when the GT data collection is affected by operational constraints associated to costly and labor-intensive field campaigns, the investigation of the produced LCMs is encouraged to evaluate the behavior of the land-cover classifiers. Only relying on quantitative analysis, via standard classification metrics, can provide a limited comprehension of the methods behavior regarding the whole study area.
2) Visualization of Internal Feature Representations: In this last stage of our experimental evaluation, we provide a visual inspection of the internal feature representation learned by GFK, ADDA, DANN, and SpADANN on the transfer task (2018 → 2020). To this end, we randomly chose 300 samples per land-cover class from the target domain and we extracted the corresponding feature representation per method. Subsequently, we have applied t-SNE [62] to reduce the feature dimensionality for visualization purposes. Results are depicted in Fig. 17. We can note that all the methods well separate samples coming from the water and bare soil/built-up classes from the rest of the data. While GFK and ADDA clearly mix samples from all the other land-cover classes together, DANN and SpADANN partially alleviate clutter issues on the remaining classes with the latter providing a slightly better visual behavior in terms of cluster structure, on the considered subset of target data, than the former. This can be noted, for instance, regarding both the grassland, shrubland, and the forest classes. Overall, the visualization of internal features representation is coherent with the quantitative as well as qualitative findings we previously discussed.

VI. DISCUSSION
To summarize, our research study proposes a novel framework to perform tUDA for land-cover mapping from SITS data. It couples together adversarial and self-training learning with the aim to cope with the distribution shifts affecting data coming from different years and hindering the transfer of standard machine learning models. In addition, to the best of our literature Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. survey, this is the first time that recent deep learning methods are leveraged in the context of temporal transfer of LULC models from SITS data.
First, we underline that our framework exploits the spatiotemporal information carried out by remote sensing data in order to temporally transfer the final LULC classification model. It explicitly leverages spatial information in order to transfer the model from one year (the source domain) to another year (the target domain) via self-training. The spatial alignment facilitates the identification of stable regions that act as tie points between the two domains, while the self-training strategy allows the model to learn from its predictions. As underlined by the ablation analysis, such components are fundamental to support the behavior of SpADANN in order to achieve its final goal.
Second, we have observed that, in general, performances vary from one transfer task to another. This is well-known in the general field of domain adaption since not all transfer tasks are equal [63]. More precisely, in our experimental evaluation, we have noted that class imbalance in the source domain (i.e., D s = 2018) can negatively influence the transfer from one year to another one as well as major changes in class distributions (i.e., the underlying cropping practices). These points suggest that SpADANN can be deployed in situations where no dramatic changes in the underlying landscape happen, thus potentially limiting costs and human efforts associated to field campaigns while reducing their frequency, for instance, from annual to every two or three years.
Third, the use of domain adaptation for temporal LULC transfer opens new room for investigation in order to reuse already acquired data (on the same study site) with the aim to increase the return of investment on field campaigns and efforts done in the past. We have observed that, in some cases, combining together two years of SITS data and previously acquired reference data can ameliorate the classification performances on the target domain (i.e., transfer task 2020 → 2021) due to the fact that SpADANN is steered to extract invariant representations w.r.t. a specific domain, thus alleviating year/domain specific issues (i.e., complex and unfavorable acquisition conditions). In addition, here, we have focused our attention on the monosource (monoyear) setting in which only a specific year is used as source domain, while in different real-world LULC applications, we could access reference and satellite data spanning several previous years, thus allowing the process to exploit multiyear information under the lens of unsupervised multisource domain adaptation [12], [64], a recent family of techniques that extends standard UDA to consider multiple (related) source domains with the aim to generalize on the unlabeled target domain.
Fourth, connections between recent spatial [11], [47] and tUDA approaches for SITS land-cover mapping can be drawn. Both families of methods have the objective to cope with possible distribution shifts between source and target SITS data, thus coping with intrinsic domain shifts that can be caused by different environmental, weather, or climate conditions of acquisition. In our framework, in order to cope with the tUDA scenario, the spatial alignment between source and target data is explicitly exploited with the aim to alleviate distribution shifts, while this characteristic cannot be leveraged in the context of spatial UDA since the two domains are spatially unrelated. This fact prevents the use of SpADANN, as it is, for the spatial UDA scenario while the contrary should be possible. Nevertheless, as shown in Section V-C3, the use of spatial alignment derived information constitutes a crucial asset that effectively guides the self-training process. For this reason, methods that will not integrate such knowledge will probably fail to provide an effective solution for the tUDA scenario.
Finally, we remind that our task is characterized by operational/realistic constraints, implying a limited and sparse amount of reference data from which the relationships between remote sensing data and the fine land-cover classes are learnt. This is why, in our case study, the TempCNN deep learning approach does not exhibit competitive behavior compared with the standard machine learning techniques, such as RF, regarding intradomain classification. Conversely, the use of both source and target domains, simultaneously, together with the self-training strategy we have proposed permits to increase the amount of data labels the model can access to learn its internal classification function. This means that the proposed framework can be deployed in situations characterized by moderate data labels availability due to its capacity to progressively and incrementally exploit knowledge coming from the two domains in a complementary manner.
The conducted research opens the way to several future works. As of now, SpADANN works in a standard (monosource) UDA setting where only a single labeled source domain is considered. Due to the fact that previous field campaigns can span multiple years, a possible research direction is the extension of SpADANN to a multisource UDA setting where multiple labeled source domains can be exploited in order to further improve the temporal transferability performances. Another possible follow-up is related to extend our framework to a multiple modality scenario where the study area is described by multisensor remote sensing data, such as, for instance, SITS coming from both synthetic aperture radar and optical sensors (e.g., Sentinel-1 and Sentinel-2). While the majority of UDA approaches consider a monomodality setting where domains are described by only one modality, very few research studies exist for the UDA under the multimodality scenario, even in the general field of computer vision and signal processing.

VII. CONCLUSION
In this work, we have presented SpADANN, a new framework to cope with tUDA for land-cover mapping from SITS data. Our approach combines adversarial learning and self-training with the aim to progressively transfer/adapt an NN model from a source domain (a specific year featured by GT data) to a target domain (a successive year where no label information is available) in order to provide land-cover classification on the latter. While the adversarial learning strategy is implemented by means of GRL, in order to extract domain-invariant features, the self-training stage selects pseudolabels on the target domain leveraging spatial consistency between domains.
The obtained results on the Koumbia study site have highlighted the quality of our framework regarding both quantitative and qualitative analyses with respect to all the UDA competitors. This is tightly related to the fact that SpADANN explicitly takes advantage of the spatiotemporal features that highly characterize SITS data. In addition, we could also show that, when the general land-cover distribution does not exhibit drastic changes between source and target domain, the proposed method is highly competitive compared to a model directly trained on the target domain. This last point can be explained by the fact that our framework focuses its attention on domain-invariant characteristics, thus, probably, discarding specific per-year information that can be related to local (in terms of domain) behaviors or artifacts and leveraging more training data due to the new self-training strategy we have proposed.

A. Ablation Analysis for the (2018 → 2021) and the (2020 → 2021) Transfer Tasks
The results obtained per (2018 → 2021) and (2020 → 2021) transfer tasks, and detailed in Tables VIII and IX, confirm   TABLE VIII  F1-SCORE  Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. the findings presented for (2018 → 2020) transfer task in Section V-C3.

B. LCMs for the (2018 → 2021) and the (2020 → 2021) Transfer Tasks
As with (2018 → 2020), the challenging transfer task (2018 → 2021) is affected by a longer term changes in seasonal vegetation dynamics that appear after a 3 year delay and a redistribution of the proportions among the different crop classes. That is why, we can formulate after visual analysis of Figs. 18, 19, and 20 the same findings as for transfer task (2018 → 2021).
In the case of task (2020 → 2021), Fig. 21, using the direct transfer strategy (ONLY D s ) with RF, the provided map mainly shows a significant reduction of the surface covered by the cotton and cereals class with respect to the baseline to the profit of the oleaginous/legouminous class. An increased noise in the natural vegetation areas is also present. When a UDA approach is used, things improve neatly: using DANN, the extent of the cotton class is mainly restored, but some disproportion is present between the cereals and oleaginous/leguminous class, whose discrimination is more challenging. Finally, SpADANN seems   to visually provide the best results, with an almost completely restored balance among crop classes, as well as improved details over natural vegetation areas. To better appreciate the spatial precision of the SpADANN maps w.r.t. its direct competitor, we also report some zoomed-in areas in Figs. 22 and 23. In both cases, spatial details better emerge in the maps provided by SpADANN, both over agricultural fields, with more structured and less noisy plots (in terms of salt and pepper error), especially over the cotton and cereals classes, and over natural spaces.

C. Visualization of Internal Feature Representations for the (2018 → 2021) and the (2020 → 2021) Transfer Tasks
Regarding statements mentioned in Section V-D2, they also apply to t-SNE visualization for transfer task (2018 → 2021) and (2020 → 2021), which are, respectively, detailed in Figs. 24 and 25.