Journals & Magazines >IEEE Journal of Selected Topi... >Volume: 18

A Deep Learning Architecture for Land Cover Mapping Using Spatio-Temporal Sentinel-1 Features

Abstract:

Land cover (LC) mapping using satellite imagery is critical for environmental monitoring and management. Deep learning (DL), particularly convolutional neural networks (C...Show More

Topic: Satellite Remote Sensing for effective Natural Resource management

Metadata

Abstract:

Land cover (LC) mapping using satellite imagery is critical for environmental monitoring and management. Deep learning (DL), particularly convolutional neural networks (CNNs) and vision transformers (ViTs), have revolutionized this field by enhancing the accuracy of classification tasks. In this work, a novel approach combining a transformer-based Swin-Unet architecture with seasonal synthesized spatio-temporal images has been employed to classify LC types using spatio-temporal features extracted from Sentinel-1 (S1) synthetic aperture radar (SAR) data, organized into seasonal clusters. The study focuses on three distinct regions—Amazonia, Africa, and Siberia—and evaluates the model performance across diverse ecoregions within these areas. By utilizing seasonal feature sequences instead of dense temporal sequences, notable performance improvements have been achieved, especially in regions with temporal data gaps such as Siberia, where S1 data distribution is uneven and nonuniform. The results demonstrate the effectiveness and the generalization capabilities of the proposed methodology in achieving high overall accuracy (OA) values, even in regions with limited training data.

Topic: Satellite Remote Sensing for effective Natural Resource management

Published in: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing ( Volume: 18)

Page(s): 10562 - 10581

Date of Publication: 03 April 2025

ISSN Information:

DOI: 10.1109/JSTARS.2025.3557687

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Land cover (LC) mapping has emerged as a critical tool for a diverse range of applications including forest monitoring, agriculture, urbanization, flood monitoring, and climate change analysis. Accurate LC maps are essential for shaping effective land use policies. The LC maps are extremely beneficial for evaluating ecosystem health. They allow users to monitor conditions across different regions worldwide, aiding in more informed decisions for effective ecosystem management. In addition, these maps help track environmental health and identify the impacts of ongoing changes.

The advent of free optical and SAR datasets, such as those from the Sentinel constellation, combined with cloud processing platforms such as Google Earth Engine [1] and the Copernicus Data and Information Access Services platforms, has greatly facilitated LC mapping on a broad scale.

Over the years, a variety of methodologies have been explored for LC mapping. Early approaches primarily utilized optical imagery, as seen in works that leveraged the Landsat archive for Eastern Europe and Romania [2], [3]. More recent studies have begun incorporating SAR data, often in combination with advanced machine learning (ML) techniques, to enhance the accuracy and detail of LC maps [4], [5], [6], [7]. In the latter, for instance, a ML-based approach using a random forest (RF) classifier on multitemporal SAR Sentinel-1 (S1) data to classify vegetation LC types is employed, highlighting the ongoing evolution in methodology and application scope.

The integration of DL methodologies, particularly convolutional neural networks (CNNs), has marked a significant advancement in the field of image processing, including remote sensing for LC mapping. CNNs are exceptionally adept at extracting local spatial patterns directly from raw inputs through their convolutional layers, thereby learning enhanced feature representations [8], [9], [10], [11]. Their encoder-decoder architectures, exemplified by models such as U-Net [12], efficiently map satellite imagery into detailed segmentation maps while preserving spatial details through skip connections. However, the emergence of vision transformers (ViTs) has introduced a new paradigm in understanding global dependencies within images. Unlike CNNs, ViTs utilize multihead attention mechanisms to capture long-range contextual relationships, offering a distinct advantage in scenarios where global contextual understanding is crucial [13]. In the specific context of SAR data for LC mapping, several DL architectures have been tailored to address the unique challenges presented by this type of data. CNNs remain a popular choice, as evidenced by their extensive use in various studies. For example, Kussul et al.[14] used a CNN architecture applied in a heterogeneous environment for crop classification in the Kyiv region of Ukraine, using time series acquired by Landsat-8 and S1A, achieving an overall accuracy (OA) of about 94% . Fontanelli et al.[15] evaluated the performance of COSMO-SkyMed X-band dual-polarized data in the test area in Ponte a Elsa (Tuscany, Central Italy) in January–September 2020 and 2021. In this case, a CNN-based classifier was arranged, trained, and used for the LC mapping, and an OA above 90% was marked.

Other innovative approaches include the integration of recurrent neural networks (RNNs) to better handle the temporal dynamics of multitemporal SAR data, enhancing classification accuracy significantly over traditional ML methods [16].

In addition, hybrid models that combine CNNs with RNNs, such as the fully convolutional network (FCN) combined with convolutional long short-term memory, have demonstrated their effectiveness in extracting both spatial and temporal features from SAR data [17]. Such integrative approaches exemplify the ongoing innovation in DL techniques tailored for enhanced LC mapping using SAR data.

In summary, most traditional high-resolution (HR) LC maps are generated using either optical data alone or a combination of SAR and optical data, and are predominantly focused on limited regional areas [18], [19], [20] while only a few provide global-scale coverage, relying solely on optical sources [21]. In contrast, medium-resolution maps, with spatial resolutions of 300 m [22] or 100 m [20], offer broader global coverage. The exclusive use of SAR data, combined with advanced DL techniques such as CNN and ViT, has enabled more comprehensive and detailed analyses of LC on a global scale. This evolution underscores the dynamic nature of remote sensing technologies and their growing importance in global environmental monitoring and policy-making. Building upon the significant progress made in the application of DL and CNNs for LC mapping using SAR data, the approach delineated in this work uniquely advances the field through the strategic utilization of synthesized spatio-temporal images, also called features, on large scale. The features are synthetic images which well exploit the SAR time series to extract the temporal and spatial information concerning the observed “real world.” These synthetic images, the features, are obtained by pure computation, i.e., by modeling the real world through the application of the temporal and spatial filters.

The existing literature predominantly focuses on satellite-based mapping solutions that rely heavily on optical imagery, which is significantly constrained by weather conditions. In contrast, studies utilizing dense and extended temporal series of SAR imagery for LC mapping remain relatively scarce. This work introduces a DL-based approach that leverages radar-derived information, incorporating both temporal and spatial features extracted from S1 imagery. These features are systematically organized into seasonal clusters rather than relying on continuous dense sequences. This methodology stands out by providing not only spatial information but also integrating temporal data through a novel seasonal division approach. This strategy markedly enhances LC classification performance, particularly in scenarios where Sentinel-2 (S2) tiles exhibit sparse temporal coverage. For instance, Ghassemi et al.[23] used a combination of features derived from S2 and S1 data. For S1, the features analyzed include temporal monthly and yearly statistical measures—such as the median, 5th, 50th, and 98th percentiles—calculated from the VH and VV polarimetric bands, as well as their combination (ratio and difference), and various indices ( $RVI$ , $NDPI$ , $DPSVIm$ ). Instead, in [24], temporal features are derived solely from S1 data, resulting in a spatially and temporally consistent time series using both VV and VH polarizations, as well as the cross-polarization ratio (CR), which is calculated as the ratio of VH to VV backscatter. The backscatter values ( $\sigma 0$ ) are averaged over 10-day periods for each pixel. This is done separately for ascending and descending acquisitions and for both VV and VH polarizations. The averaged backscatter values are then converted to decibels (dB), which standardizes the data for further processing. The CR, which is the ratio of VH to VV backscatter, is calculated for each scene and then averaged over the same 10-day periods. The study creates a gridded time series of consistent temporal features for VV, VH, and CR, used as input data for ML models. In contrast, the approach in the present work centers on spatial and textural feature extraction from S1 data, VH polarization. A set of spatial domain filters was applied to four seasonal composite images, also referred to as “super images.” Each of these super images represents a specific season (winter, spring, summer, and autumn) and is generated by calculating the arithmetic mean of S1 images taken during that season. This approach allows for the aggregation of temporal information into a single representative image per season, enabling more efficient spatial analysis while preserving seasonal variability. The “super image” provides a single, characteristic view, onto which spatial filters are applied to highlight spatial patterns and textural features of the radar backscatter. This technique offers a spatial representation that enhances texture and spatial details, rather than focusing on temporal variability. Moreover, regarding the calculation of the arithmetic mean, the number of S1 images varies from season to season, and this variability also depends on the area of interest. Due to the substantial volume of data in an annual time series over a large area of interest, a key advantage of the seasonal multitemporal approach is that there is no need to select a specific number of images based on strict spatial coverage criteria (e.g., 80 $\%$ , 90 $\%$ , or 100 $\%$ ). By averaging the temporal data, even images with partial spatial coverage—such as those capturing only a corner or an edge—can be included in the analysis.

In summary, the linked studies mentioned above prioritize temporal consistency and temporal backscatter trends through multiperiod averaging and cross-polarization, while the present work highlights spatial and textural details from a single seasonal composite, aiming at improving the spatial representation and at enhancing textural features in VH backscatter data. By synthesizing features from available seasonal data, the used system adeptly captures the essential spatial characteristics necessary for accurate LC mapping, while effectively managing the variability introduced by less frequent imaging. Hence, the used network proves to be effective for global LC classifications, achieving robust results with a technique that efficiently summarizes spatial characteristics from sparser temporal data, tailored to suit the climatic characteristics of different regions given the 10 m resolution employed.

In this work, a novel approach combining a transformer-based Swin-Unet architecture with seasonal synthesized spatio-temporal images has been employed to classify LC types using spatio-temporal features extracted from S1 SAR data, organized into seasonal clusters.

The workflow of the Swin-Unet architecture is shown in Fig. 1, and it was first introduced by [25] for medical image segmentation tasks. Unlike traditional CNNs, the Swin-Unet architecture overcomes limitations in learning global and long-range semantic information by incorporating a patch partition layer and a Transformer-based U-shaped encoder-decoder architecture with skip connections. This allows for both local and global semantic feature learning. The encoder utilizes a hierarchical structure with a shifted window self-attention mechanism to extract context features, while a symmetric decoder with a patch-expanding layer performs the upsampling operation, restoring the spatial resolution of the feature maps. Recent advancements in computer vision, such as the integration of attention modules such as CBAM in YOLOv8 architectures, have demonstrated the effectiveness of refining spatial features to enhance model performance in complex environments [26]. While these techniques are applied to object detection, our approach leverages the Swin-UNet model, which inherently incorporates hierarchical global attention through a shifted window mechanism, enabling precise spatio-temporal feature extraction for LC classification. This approach leads to superior results compared to those obtained using 2-D and 3-D CNN-based architectures such as Attention U-Net, oktay2018attention, and 3D-FCN from [28]. Another advantage is that the transformer model has a significantly lower number of trainable parameters (25 million, compared to approximately 31 million and 26.5 million, respectively), leading to reduced computational costs, decreased memory requirements, and mitigated risk of overfitting.

Fig. 1.

Overview of the employed Swin-UNet architecture described in Section II-D for the LC classification task. Distinctive aspects of the Swin Transformer architecture are combined with the UNet architecture to achieve optimal performance.

Show All

Compared to other works, this study conducts LC mapping at a global level by analyzing three distinct regions of interest: Africa, Amazonia, and Siberia.

Unlike many studies that focus on specific and spatially limited regions, this work extends the analysis across diverse climatic zones and further refines the study areas into ecoregions, allowing for a detailed assessment of classification accuracy across different environmental conditions. Moreover, while existing approaches frequently rely on multisensor data (combining radar and optical sources) this study demonstrates that solely using S1 SAR data is sufficient to achieve remarkably high accuracy across all three test areas.

Another key distinction is that, whereas some methodologies depend on very high-resolution radar data, this research effectively leverages S1 imagery at 10 m resolution, yet still achieves optimal classification results. Furthermore, instead of employing dense, continuous temporal sequences as commonly done, this study introduces a novel feature organization strategy based on seasonal clustering. This innovative approach not only enhances LC classification performance but also provides a robust solution for scenarios with limited temporal coverage, offering a more structured and integrative use of both spatial and seasonal temporal information.

Moreover, it is worth to highlight that this work was developed in the framework of ESA CCI+ HR LC project, Phase 2. This project focuses on understanding how LC and LC changes affect climate, aiming to improve climate modeling by examining the impact of spatial resolution on climate data. The LC analysis is crucial for measuring surface energy, water fluxes, greenhouse gas sources, and monitoring land use changes and extreme weather events. The project supports the global climate observing system's essential climate variables (ECVs) to aid the United Nations (UN) Framework Convention on Climate Change. It aims to explore how varying spatial and temporal resolutions influence LC classification and to develop methods for generating and updating ECV products for long-term climate research.

The rest of this article is organized as follows. Section II delves into the data preprocessing steps and neural network data preparation, detailing the extraction and computation of features of interest and the generation of the training set. Section III examines the data sources and the three chosen macro areas for testing. Section IV presents the outcomes across these selected study areas, including an assessment of performance within different ecoregion boundaries. Section V outlines discussion. Finally, Section VI concludes this article and provides potential future developments of this work.

A. Key Contributions

To recap, the key contributions of this article are as follows.

Development and utilization of synthesized spatio-temporal features derived from S1 SAR time series, organized into seasonal clusters. This approach effectively captures essential spatial and temporal characteristics without relying on dense temporal sequences, offering computational efficiency.
Enhanced LC classification performance through the use of seasonal clustering of weather-independent S1 SAR data, particularly in areas where optical data are significantly impacted by cloud coverage.
The innovative integration of the Swin-Unet model, which leverages hierarchical global attention to improve feature extraction and spatial representation.
Application of a global-level HR LC mapping approach across diverse regions (e.g., Africa, Amazonia, Siberia) and their respective ecoregions. This method achieves high classification accuracy using 10 m resolution S1 SAR data, even in regions with limited training data.
Extensive validation across distinct ecoregions, showcasing the model's generalization capability and robust classification accuracy in diverse environmental contexts.
Advancement of research within the framework of the ESA CCI+ HR LC project, Phase 2. The work emphasizes the significance of LC changes for climate studies and contributes to the generation of essential climate variables (ECVs) for long-term climate monitoring.

SECTION II.

Proposed Approach

The proposed approach is based on SAR S1 sequence classification using a DL architecture. Specifically, a SAR sequence is subdivided into seasonal subsequences, and a set of spatial features are computed from these reduced time series. To these spatio-temporal features the aforementioned Swin-Unet is then applied. The overall block diagram of the described procedure is shown in Fig. 2.

Fig. 2.

Simplified workflow diagram of the proposed mapping procedure applied to SAR temporal sequences. The preprocessing part was done using a SNAP graph, as explained in Section II-A. The multitemporal speckle noise reducer, feature extraction, training, and validation set generation are also described in Section II, while the DL-based block and the results are discussed in Section IV.

Show All

The proposed methodology is applied to spatial subsets (tiles) according to the tiling system used by ESA for S2 [29]. This allows for a uniform coverage of geographically large areas, and potentially the entire Earth's surface. Indeed, in this article the approach is applied to S1 SAR datasets in test areas covering Amazonia, Africa, and Siberia. These three areas were selected to assess the robustness of the methodology in regions with significantly different environmental and climatic characteristics, as discussed in more details in Section III-B.

A. SAR S1 Data PreProcessing

The initial step involves preprocessing the radar sequences, similar to the method employed by the European Space Agency's (ESA) Sentinel application platform. [30]. Each considered S1 sequence is assumed as composed by multiple images taken on the same orbit with the same beam and polarization. These images are preprocessed according to a set of correction and refinement steps [31] and shown in Fig. 3.

Fig. 3.

Block diagram of S1 data preprocessing.

Show All

The chain in Fig. 3, consists of a number of steps applied to the Level-1 S1 data, as follows:

Orbit file application: To correct the satellite position and add velocity information, in order to achieving a precisely geocoded observations.
Thermal noise removal: Thermal noise is an additive interference component superimposed on the signal of interest and is processed with the same processing gains applied to the true signal. Thermal noise removal reduces the effects of the thermal distortion in the intersubswath texture and performs the normalization of the backscattered signal within the entire acquisition.
Border noise removal: The border noise removal algorithm [32] removes the low intensity noise caused by radiometric artefacts and invalid data at the edges of the image.
Radiometric calibration: Radiometric calibration is necessary because the gray level of SAR imagery must be adjusted taking into account the backscatter signals from objects in the area. The digital number of the pixel is converted to a radiometrically calibrated backscatter value.
Geometric terrain correction: Terrain correction eliminates the distortion caused by the topographical variations [33], thereby increasing the accuracy of the location of the objects in the scene. The correction is performed by using the digital elevation model data to extract the height information of the object. The data is then resampled using nearest-neighbor interpolation at a pixel spacing of 10 m spatial resolution and geolocated in the WGS84 coordinate system.

B. Feature Extraction

Once the sequence has been preprocessed and somehow “harmonized,” spatio-temporal SAR features are extracted and used as inputs to the DL architecture. Specifically, following [34], it is assumed that temporally aggregated portions of a full annual time series are more appropriate for LC mapping. In this approach, the original SAR sequence is subdivided into four “seasonal” clusters that approximate the seasonal cycle of the different LC types. These identified clusters suffer from speckle distortion due to the nature of SAR acquisition [35], which affects the appearance of the image and the performance of the scene analysis. Speckle in SAR is a multiplicative effect (i.e., it is directly proportional to the gray value of the pixel) and is mitigated using a multitemporal denoising filter, as described below, to preserve radiometric and textural information.

1) Multitemporal Despeckle Filtering

A suitable and advanced multitemporal denoising filter based on the one described in [36] is applied to the four “seasonal” clusters. The multitemporal approach appears to provide better results than a spatial filter applied independently to each SAR image, thanks to the exploitation of the temporal sequence, to the benefit of a better spatial resolution preservation. The filter is ratio-based and computes an image, called super image, by exploiting the SAR time series. In fact, the temporal averaging of SAR time series produces the super image $\hat{\mathbf {u}}_{m}$ , according to the formula below:

$\begin{equation*} \hat{\mathbf {u}}_{m}=\frac{1}{S}\sum _{s=1}^{S}{\mathbf {x}_{s}} \, \end{equation*}$ View Source

where

$S$

is the images's number of the seasonal sequence,

$\mathbf {x}_{s}$

is the SAR image of the seasonal sequence, and

$s$

is the time index.

Speckle is reduced and spatial resolution is preserved. The filtered image is thus recovered by exploiting the statistical properties associated with the original super image. Basically, the method comprises three steps: a) computation of the seasonal super image $\hat{\mathbf {u}}_{m}$ by the arithmetic averaging of time series images; b) denoising of the ratio image; c) calculation of the final images given by the multiplication between the denoised ratio and the super image. These steps and more details are well reported in the workflow shown in Fig. 4.

$Fig. 4. - Multitemporal despeckle flowchart applied to the S1 temporal sequence. The temporal averaging of the SAR time series produces the super image $\hat{u}_{m}$. The super image is used to form the ratio image $\tau _{t}$, given by the image $\upsilon _{t}$ at time $t$ and the super image $\hat{u}_{m}$, for each pixel of the S1 images. The Lee filter is then applied to $\tau _{t}$ because the super image $\hat{u}_{m}$ suffers from speckle (although the speckle in the super image is greatly reduced), resulting in the image $\hat{\rho }_{m}$. In the last step, the restored image $\hat{u}_{t}$ is obtained by multiplying the denoised ratio image with the super image.$

Fig. 4.

Multitemporal despeckle flowchart applied to the S1 temporal sequence. The temporal averaging of the SAR time series produces the super image $\hat{u}_{m}$ . The super image is used to form the ratio image $\tau _{t}$ , given by the image $\upsilon _{t}$ at time $t$ and the super image $\hat{u}_{m}$ , for each pixel of the S1 images. The Lee filter is then applied to $\tau _{t}$ because the super image $\hat{u}_{m}$ suffers from speckle (although the speckle in the super image is greatly reduced), resulting in the image $\hat{\rho }_{m}$ . In the last step, the restored image $\hat{u}_{t}$ is obtained by multiplying the denoised ratio image with the super image.

Show All

2) Spatio-Temporal Feature Computation

After the data preprocessing described in Section II-A, a spatio-temporal feature extraction is performed using the polarimetric information derived from the intensities of the VH (cross-polarized) S1 polarization. Rather than considering complex spatial features such as shape and size, which would require unsupervised segmentation of the image, a set of textural feature set is computed for each of the following four seasonal super image $\hat{\mathbf {u}}_{m}$ :

Lee filter, an adaptive filter recognized as the first model-based approach specifically designed for reducing speckle noise [31]. It maintains significant edges, linear structures, point targets, and texture details by minimizing the mean square error.
$\begin{equation*} u_{\text{LEE}}(i,j)=\bar{u}_{m}(i,j)+K \cdot (\hat{u}_{m}(i,j)-\bar{u}_{m}) \, \end{equation*}$ View Sourcewhere $u_{\mathrm{{LEE}}}(i,j)$ is the Lee-filtered pixel value at position $(i,j)$ , $\bar{u}_{m}$ is the local mean of the pixels within a specified kernel $K$ around the pixel $(i,j)$ , $\hat{u}_{m}(i,j)$ is the seasonal super image at location $(i,j)$ , and $K$ is the weighted factor given by the noise variance and the local variance within the window.
Median filter, is a nonadaptive filter that substitutes each pixel's value with the median of the values from the surrounding local neighborhood as follows:
$\begin{equation*} u_{\text{MEDIAN}}(i,j)=\text{median} \lbrace \hat{u}_{m}(i^{\prime },j^{\prime }) \ | \ (i^{\prime },j^{\prime }) \in \mathcal{N}(i,j)\rbrace \, \end{equation*}$ View Sourcewhere $\mathcal{N}(i,j)$ is the neighborhood defined by the kernel centered at pixel $(i,j)$ , the set $\lbrace \hat{u}_{m}(i^{\prime },j^{\prime })\rbrace$ contains the pixel values within the kernel surrounding the pixel $(i,j)$ . The median function selects the middle value from the sorted pixel values in the neighborhood.
Mean filter, is among the most commonly utilized low-pass filters (LPF), replacing the value of the pixel with the average of all the values within the surrounding local neighborhood (filter kernel) as follows:
$\begin{equation*} u_{\text{MEAN}}(i,j)=\frac{1}{N}\sum _{(i^{\prime },k^{\prime })\in \mathcal{N}}\hat{u}_{m}(i^{\prime },k^{\prime }) \, \end{equation*}$ View Sourcewhere $u_{\text{MEAN}}(i,j)$ is the mean-filtered image at pixel $(i,j)$ , $\mathcal{N}$ is the neighborhood defined by the kernel size, $N$ is the total number of pixels in the neighborhood $\mathcal{N}$ , and $(i^{\prime },k^{\prime })$ are the coordinates of the pixels in the neighborhood around $(i,k)$ .
Maximum (or minimum) filter, nonlinear filter identifies the brightest (or darkest, for the minimum) point in an image. It is based on the median filter, as it corresponds to the $100\text{th}$ (or $0\text{th}$ , for the minimum) percentile, meaning it selects the maximum (or minimum) value of all the pixels within a specified local region of the image.
$\begin{align*} u_{\text{MAX}}(i,j)=\max _{(i^{\prime },j^{\prime })\in \mathcal{N}} \hat{u}_{m}(i^{\prime },j^{\prime }) \ & \\ u_{\text{MIN}}(i,j)=\min _{(i^{\prime },j^{\prime })\in \mathcal{N}} \hat{u}_{m}(i^{\prime },j^{\prime }) \ &. \end{align*}$ View Source
Range (Max–Min) filter, enhances image contrast by calculating the difference between the dilation and erosion (maximum and minimum) of the original image. For the S1 seasonal super image $\hat{\mathbf {u}}_{m}$ , the filtered output $\mathbf {u_{\mathrm{Max-Min}}}$ is given as follows:
$\begin{equation*} \mathbf {u_{\mathrm{MAX-MIN}}}=\mathbf {u_{\text{MAX}}}-\mathbf {u_{\text{MIN}}} \, \end{equation*}$ View Sourcewhere $\mathbf {u_{\text{MAX}}}$ and $\mathbf {u_{\text{MIN}}}$ are the output of applying maximum (dilation) and minimum (erosion) filters to the input super image $\hat{\mathbf {u}}_{m}$ , respectively, as defined above.

A kernel of $5 \times 5$ pixels is used to derive each of these spatial statistical descriptors.

In summary, after the SAR sequence is properly preprocessed, four subsequences are selected and the corresponsing super images super are computed. Finally, for each superimage seven spatial features are extracted, resulting in 28 features to be used for the classification.

C. Network Preprocessing

Before being fed into the DL archirecture, to the aforementioned features a min–max normalization is applied, to ensure that all data falls within the same scale range. This is performed because unscaled input variables can result in a slow or unstable learning process [37].

In addition, data augmentation techniques are employed since, as expressed in [38], DL architectures (especially ViTs) require large training datasets to maximize their classification accuracy. Specifically, data augmentation is implemented by dividing both the input and reference images into smaller patches, each measuring 256 pixels, with an overlap determined by a stride of 128 pixels. This approach enhances dataset diversity and increases the number of training samples. By generating smaller patches, the method improves dataset coverage, which is particularly beneficial when only a limited number of tiles are available, such as in the context of ecoregion analysis (see Section IV-B).

Increasing the dataset size also helps in preventing overfitting issues and in reducing memory usage. Utilizing smaller patches instead of processing entire tiles with the original size (512 × 512 pixels) minimizes memory requirements and computational burden, thereby improving the efficiency of the training process.

Following the aforementioned patching process, the number of samples is increased significantly across all regions. In Siberia, the initial dataset of 64 tiles is expanded to 576 patches. Similarly, Africa grows from 103 tiles to 927 patches, while Amazonia increases from 86 tiles to 774 patches. This increase in sample availability strengthens model training by providing a more comprehensive representation of spatial patterns, improving generalization, and enhancing robustness across different regions.

D. Swin-Unet

As introduced in Section I, the best performing DL model was used for the classification task at hand is the Swin-Unet,cao2021SwinUnet. This architecture consists of three main blocks: the encoder, the bottleneck, and the decoder.

The encoder utilizes multiple Swin Transformer layers [39], designed to process the input images hierarchically through a series of stages. Each stage implements the shifted window self-attention,yu2022selfattention. This mechanism allows the model to efficiently capture local interactions in the early stages and progressively build up to understanding broader areas of the input image. This effectively reduces the resolution of feature maps while increasing the feature dimensions, enabling a deep and rich understanding of the input data.

The bottleneck serves as the critical transition point between the encoder and decoder modules. It is typically composed of one or more Swin Transformer layers situated at the deepest part of the network. This middle part of the network focuses on integrating and compressing the high-level features learned by the encoder.

Finally, the decoder focuses on progressively expanding the encoded features back to the original image resolution. It employs Swin Transformer layers arranged in stages, each one using a patch expanding layer to progressively increase the spatial resolution of the feature maps. In addition, skip connections from corresponding encoder stages are integrated at each level of the decoder. These connections help restore spatial details that are often lost during downsampling in the encoder. A simple diagram of the overall model is shown in Fig. 1.

E. Swin-Unet Model Configuration

The Swin-Unet model configuration was carefully designed to balance computational efficiency and model performance. Among the selected parameters are feature size, batch size, and learning rate.

The feature size parameter determines the dimensionality of the initial feature maps and directly impacts the model's complexity. For this study, the value was set to 48, resulting in approximately 25 million trainable parameters, as shown in Table VI under the Complexity column. A more detailed discussion of model complexity, including additional parameters influencing computational demands, is presented in Section V. This choice represents a balance between model capacity and computational overhead. Although higher values such as 60 or 72 could enhance the model's representational capability, they also introduce significantly higher memory requirements and computational costs. Testing with values greater than 48, such as 60 and 72, did not result in significant accuracy improvements, but exponentially increased the model's complexity, reaching approximately 39.2 and 56.5, respectively.

TABLE I Class Legend of MOLCA Dataset With the Number of Samples per Region (Train and Test Set)*

TABLE II OA, Kappa Coefficient, F1-Score, and PA for the Considered Models in Different Regions

TABLE III Seasonal Count of S1 Images From the 2021 Time Series for Amazonia, Africa, and Siberia

TABLE IV Results for Standard Input Time Series for Amazonia, Africa, and Siberia

TABLE V OA by Swin-Unet Model Across Random States for a Robust Cross-Validation With a Consistent 70% -30% Train-Test Split for Amazonia, Africa, and Siberia

TABLE VI Comparison of FPS at Varying BS and Model Complexity (Millions of Trainable Parameters) for the Three DL Models Tested in This Work

The batch size for model training was fixed at 1 due to the high-dimensional nature of the input data, which consists of 28-channel spatio-temporal features. This choice was influenced by memory constraints, as larger batch sizes would exceed the available GPU memory.

The learning rate was set to $10^{-4}$ , ensuring stable and efficient optimization, achieving consistent improvements on the validation set. Training was carried out for a total of 30 epochs, following the parameter hyperparameter configuration detailed in [28]. Specifically, the model utilized the Adam optimizer and Categorical Cross-Entropy loss function for LC classification optimization. These settings allowed the model to handle the 28-feature temporal sequence efficiently and ensured stable performance during training.

By setting feature size to 48, batch size to 1, and learning rate to $10^{-4}$ , the Swin-Unet model achieved an optimal balance between computational feasibility and predictive accuracy. The total complexity of the model, measured as the number of trainable parameters, ensured the capacity to efficiently handle the LC classification task while maintaining computational efficiency. This configuration was validated through experimental runs across various dataset separations, as shown in Table V, demonstrating stable convergence and robust accuracy on the test set, regardless of the data split.

SECTION III.

Input Data and Test Areas

As mentioned earlier, in this work the general procedure described in the previous section has been applied to S1 datasets in specific geographical locations.

A. S1 Mission and Data

The S1 mission is the radar component of the European Copernicus programme, which has many operational applications in addition to the LC mapping. It is composed by a constellation of two satellites, S1A and -1B, with a C-band synthetic aperture radar sensor. Each satellite is equipped with right-looking antennas with an angle of incidence between 29.1 $^\circ$ and 46 $^\circ$ and has a revisit time of approximately 6 days over land (equatorial in a polar orbit).

The data are also available free of charge to the public via the Copernicus Open Access Hub,¹ making it more attractive for new challenging applications and opportunities [41].

For this work, level-1 ground range detected (GRD) products acquired in interferometric wide swath (IW) mode were used. In these products the phase information is lost due the application of the multilooking filter and the ground range projection based on an Earth ellipsoid model. The datasets are in HR and provide images with a native range by azimuth resolution $20 \text{ m}\times 22$ m and pixel resolution equals to $10\text{ m} \times 10$ m.

In this study, the VH polarization has been selected, as outlined in [7] and [31], due to its heightened sensitivity to vegetation, water and urban characteristics [42], [43], [44]. Furthermore, to strengthen this choice, tests were conducted using the VH polarization alone, as well as in combination with the VV polarization. The comparative results of these tests are documented in the product validation and algorithm selection report (PVASR) for the ESA CCI+ HR LC project, Phase 1, which is freely accessible on the dedicated website under the “Key Documents” section. The findings indicated that the inclusion of the VV polarization did not provide a consistent improvement in LC classification, thereby reinforcing the preference for the VH band.

B. Test Areas

Three test macro areas in Amazonia, Africa, and Siberia (see Fig. 5) have been identified and used for the experiments, according to the guidelines of the ESA “Climate Change Initiative Extension (CCI+) Phase 2: New Essential Climate Variables (NEW ECVS)” project.

$Fig. 5. - Enlargements of the test areas: (a) Amazonia (62.1014$^\circ$ W, 23.5983$^\circ$ S : 42.9441$^\circ$ W, 0$^\circ$ N, WGS 84), (b) Africa (9.8986$^\circ$ E, 0.0885$^\circ$ S : 43.2908$^\circ$ E, 18.0891$^\circ$ N, WGS 84), and (c) Siberia (64.4361$^\circ$ E, 51.2789$^\circ$ N : 93.4017$^\circ$ E, 75.6847$^\circ$ N, WGS 84).$

Fig. 5.

Enlargements of the test areas: (a) Amazonia (62.1014 $^\circ$ W, 23.5983 $^\circ$ S : 42.9441 $^\circ$ W, 0 $^\circ$ N, WGS 84), (b) Africa (9.8986 $^\circ$ E, 0.0885 $^\circ$ S : 43.2908 $^\circ$ E, 18.0891 $^\circ$ N, WGS 84), and (c) Siberia (64.4361 $^\circ$ E, 51.2789 $^\circ$ N : 93.4017 $^\circ$ E, 75.6847 $^\circ$ N, WGS 84).

Show All

These areas correspond to very different LC and climate typologies, and the selected portions have a wide extension each (5.370 billion km $^{2}$ in Amazonia, 7.340 billion km $^{2}$ in Africa, and 3.787 billion km $^{2}$ in Siberia).

Amazonia is a region dominated by vegetation and hot tropical weather, a perfect example of the utility of SAR data: indeed, it is very difficult to obtain cloud-free optical images over this area due to the harsh climate, with precipitation ranging from 200 mm to 320 mm per month and an average humidity of 89% .
Africa is a very complex, climate-sensitive region with a history of severe weather events, many of which are linked to global warming. Morphologically, the region is characterized by bare soil, lakes, and arable land, and many zones have experienced severe droughts worth a deeper investigation.
Finally, Siberia has a very cold climate year round, making it a potential hotspot for future climate change research. The region is characterized by many rivers and water bodies covered with ice and snow for about 75% of the year. Here, the advantage of multitemporal SAR is clear: SAR signals can penetrate clouds and rain, ensuring periodic data acquisitions.

C. Training Set Generation

To build the training set for the DL architecture in these areas, the map of LC agreement (MOLCA) [45] has been used. MOLCA was generated using already existing global HR LC maps, retaining only those areas where all datasets agree on the same LC class and discarding areas of disagreement (these pixels are identified as nodata and are set to zero on the map). The MOLCA images, arranged according to the S2 Level-1C product tiling grid and distributed in GTiff format, cover the above mentioned three regions in Amazonia, Africa, and Siberia, with about 117 billion pixels at 10 m resolution.

The MOLCA dataset was produced as part of the ESA - funded Climate Change Initiative Extension (CCI+) Phase 1 New Essential Climate Variables (NEW ECVS) HR LC ECV (HR_LandCover_cci) project, known as CCI HRLC or CCI+ HRLC.² The LC classes represented in MOLCA are shown in Table I, and cover the period from 2016 to 2020. The Table I also includes the number of samples (pixels) used for the train and test sets, which were employed in subsequent evaluations with DL models for each analyzed region: Africa, Amazonia, and Siberia. Notably, no representative samples (pixels) for the classes “permanent ice and snow” and “lichens and mosses” are present in the three areas of interest. The accuracy estimate for MOLCA indicates an OA of 96% [45].

As highlighted in Fig. 6, No data values account for over 50 $\%$ of the entire dataset in the study areas. For this reason, prior to the training phase, all the pixels in the input SAR sequences that correspond to this class should be set to zero. This approach helps prevent the model from learning erroneous relationships for the No data class, which is useless and misleading.

Fig. 6.

Distribution of classes for the three areas of interest.

Show All

Subsequently, to select a significant training data set, the areas in Fig. 5 were randomly and homogeneously sampled in accordance with the S2 tiling, the spatial coverage of which is shown in Figs. 7–9 for Amazonia, Africa, and Siberia, respectively. Each tile with a size of $10980 \times 10980$ pixels in the UTM coordinate reference system has been patched in smaller areas of $549 \times 549$ pixels, i.e., 1/20th of the tile linear dimensions. The most significant patch, i.e., with the largest number of LC classes is then selected by means of visual inspection for each tile and area in order to obtain a balanced representation of the LC classes present in the scenes, the distribution of which is shown in Fig. 6.

Fig. 7.

Seasonal distribution of S1 acquisitions in 2021 for the S2 tiles selected to collect the training set over the Amazonia site. The distribution is shown for: (a) Winter, (b) spring, (c) summer, and (d) autumn seasons.

Show All

Fig. 8.

Seasonal distribution of S1 acquisitions in 2021 for the S2 tiles selected to collect the training set over the Africa site. The distribution is shown for: (a) winter, (b) spring, (c) summer, and (d) autumn seasons.

Show All

Fig. 9.

Seasonal distribution of S1 acquisitions in 2021 for the S2 tiles selected to collect the training set over the Siberia site. The distribution is shown for: (a) winter, (b) spring, (c) summer, and (d) autumn seasons.

Show All

Once the most representative patches have been identified, the corresponding S1 features are computed according to the methodology presented in the previous section. The seasonal spatial distributions with respect to the availability of the S1 acquisitions are shown in Figs. 7–9, for Amazonia, Africa, and Siberia, respectively. In the graphs, the adopted colormap represents varying levels of data availability, ranging from areas with only 5–13 images (indicated in red) to regions with more than 50 acquisitions (represented in dark green). The gradient between these colors highlights intermediate values, allowing for a visual understanding of data distribution. In addition, the macro areas of interest are marked with blue rectangles. Despite the presence of red tiles in each season, the number of acquisitions is sufficient to carry out the spatio-temporal feature extraction [31]. The final training sets consist of 86 MOLCA patches and 2408 S1 features for the Amazonian area; 103 MOLCA patches and 2884 S1 features for Africa, and 64 MOLCA patches and 1792 S1 features for Siberia.

SECTION IV.

Results

For each case study, all S1 acquisitions in 2021 were considered, and four subsequences identified according to the seasons. This selection resulted into a collection of 5105 SAR images for Amazonia (1264 winter, 1310 spring, 1338 summer, and 1193 autumn data sets), 5.827 SAR images for Africa (1470 winter, 1480 spring, 1654 summer, and 1405 autumn data sets), and 3396 SAR images for Siberia (768 winter, 744 spring, 984 summer, and 900 autumn data sets). All the images are GRD, IW, VH, descending orbit data sets.

As mentioned in Section II-E, for model training, the channel axis of the input tensors to the networks was used to store the spatio-temporal information contained within the sequence of seasonal features extracted from the original SAR images. By doing so, the network accepts an input of $B \times T \times W \times H$ size, where $B$ is the batch size, $T$ is the temporal dimension, while $W$ and $H$ represent the width and the height of the images, respectively.

A. Results Across Different Study Areas

The results for the three examined regions—Amazonia, Africa, and Siberia—are presented in Table II, where the outcomes obtained using the best-performing model (Swin-Unet) are compared with two reference CNN-based models, 3-D-FCN [28] and Attention-Unet [27]. A visual comparison of the results obtained with the Swin-Unet model, including random S1 inputs from the validation set, ground truth, and corresponding predictions, is depicted in Figs. 10, 11, and 12. In these areas, the Swin-Unet model has achieved OAs of 93.3 $\%$ , 93.6 $\%$ , and 97.4 $\%$ , respectively. The associated confusion matrices are illustrated in Figs. 13, 14, and 15, where each element along the diagonal represents the producer's accuracy (PA) of the respective class. The PA associated with the No data class has been omitted from the matrix representations since the masking operation explained in Section II-C led to a 100% PA for this specific class. It is only reported for completeness in the aforementioned Table II.

Fig. 10.

Visual comparison between input S1 (first row), ground truth (second row), and predicted patches (third) for the Amazonia region.

Show All

Fig. 11.

Visual comparison between input S1 (first row), ground truth (second row), and predicted patches (third row) for the African region.

Show All

Fig. 12.

Visual comparison between input S1 (first row), ground truth (second row), and predicted patches (third row) for the Siberian region.

Show All

Fig. 13.

Confusion matrix based on the validation set for the Amazonia region.

Show All

Fig. 14.

Confusion matrix based on the validation set for the African region.

Show All

Fig. 15.

Confusion matrix based on the validation set for the Siberian region.

Show All

Notably from Table II Forest, Grassland, Cropland, and Water classes are very well extracted using the best model in all three cases of the study areas.

The comparison of the utilized models on the African dataset reveals that the CNN-based methods may struggle to distinguish Bareland from classes such as Water or Grassland in SAR images. The histogram based on the normalized mean values of the stacked seasonal features for each input pixel in the African dataset (see Fig. 16) shows that the Forest class is more clearly separated, while there is overlap among the Bareland, Water, and Grassland classes.

Fig. 16.

Histogram distribution of the normalized mean values of the stacked seasonal features for each input pixel in the African dataset, corresponding to different LC classes: Forest, Bareland, Water, and Grassland, showing the overlap among these classes.

Show All

In Africa, the 3-D-FCN gives 0.77 for Water, but 0.36 for Bareland, 0.53 for Cropland, and 0.44 for Grassland. The Attention U-Net framework instead achieved PA of 0.89 for Water, 0.66 for Forest, and 0.55 for Grassland, but only 0.02 for Bareland. In contrast, Swin-Unet achieves a final PA of 0.97 for Forest, 0.83 for Grassland, and 0.93 for Bareland, outperforming the other two models.

The enhancement in Bareland recognition is evident not only in Africa but also in the Siberian dataset. Despite comprising only 1.04% of the available tiles, Bareland achieves a PA of 0.75 through the Swin-Unet model, a significant improvement over the initial 0.16 and 0.06 values obtained with the 3-D-FCN and Attention U-Net, respectively. These results still indicate that the transformer model recognizes various LC classes and tends to make more balanced decisions compared to the other two CNN-based models, which, conversely, demonstrate high PA only for specific classes such as Water. This discrepancy may arise from the Attention U-Net model's limited generalization capability, leading to good classification performance only for classes with easily recognizable spatial patterns or unique brightness values, such as Water or Forest [see Fig. 17(a), (b), and (c)]. However, this results in poorer performance when distinguishing the more complex morphological relationships of specific LC classes such as Bareland [see Fig. 17(d), (e), and (f)]. Therefore, Attention U-Net, relying on locality-dependent attention mechanisms, may struggle with classes sharing similar pixel value distributions.

Fig. 17.

Comparison between the spatial patterns of Forest and Water (a, b, c) when compared to the more complex spatial pattern of the patches containing the Bareland class (d, e, f).

Show All

Regarding the 3-D-FCN model, its 3-D logic-based structure does not offer additional advantages when the input data is not a dense temporal sequence but instead consists of seasonal synthetic images, the so-called features. In this context, a 2-D-CNN model, such as the Attention U-Net, appears to be sufficient. In addition, the process of masking No data values, as previously described in Section III-C, hampers the learning of continuous spatio-temporal relationships, potentially resulting in a loss of contextual information. In contrast, Transformer-based models such as Swin-Unet consider global pixel relationships, leading to a more accurate assessment of context and environment, as explained in Section I. The global attention mechanisms in these models can help recover this context, which CNNs cannot achieve due to the local nature of the convolutional kernel. In addition, the Built-up class showed significant improvement through the Swin-Unet model in Amazonia and Siberia with final values of 0.85 and 0.92.

For the African region, the value for this specific class remains poor, likely due to the lower number of representative labels for the aforementioned class in this area (only 0.5 $\%$ ).

Another contributing factor could be the small and fragmented nature of urban areas, particularly when compared to more spatially uniform LC classes such as Forest, Grassland, or Cropland. In favor of these latter classes, the highest confusion is observed when predicting Built-up, indicating a significant challenge in visually distinguishing small urban areas from surrounding classes due to their reduced size and scattered appearance.

For Wetland, a high PA of 0.79 is obtained in the Siberian region while, for the other two areas, the accuracy remains low. In both cases, this class is often misclassified with Grassland, but this is likely due to the fact that samples from this LC class are represented by only 0.8% and 0.6% of the Amazonian and African datasets, respectively.

Finally, for the Shrubland class, the best-performing model demonstrates significant improvements in PA, achieving final values of 0.50 (Amazonia) and 0.52 (Africa). In contrast, the Attention U-Net model consistently yields zero accuracy scores for this LC class across all cases, while the 3-D-FCN struggles with final values of 0.08 and 0.27, respectively. However, in Siberia, PA remains zero for all three considered DL models, as this class class represents only 0.006% of the region.

It is worth noting that the obtained PA values for the Shrubland class in Africa and Amazonia may appear good but not outstanding when analyzed individually. This is due to the specific scenarios in which this class is situated. For instance, as shown in Fig. 11(b), classification benefits from a noticeable contrast in backscattering values between Shrubland and adjacent classes within the S1 image. Conversely, scenarios such as those depicted in Fig. 11(f) may exhibit slightly higher confusion due to the spatial contiguity between the target class and neighboring LC types, such as Grassland. As observed in Figs. 13 and 14, the model tends to confuse Shrubland with Grassland in 36% and 13% of cases, respectively.

The classification performance of three DL architectures was assessed against one of the most widely used nonparameteric supervised classifiers, the RF algorithm [46], known for its application in LC mapping utilizing both optical and SAR data [47], [48], [49]. An analysis of the results presented in Table II reveals that, for both the Amazon and Siberia, the RF model exhibits lower final OAs compared to all three DL approaches. In Africa, the RF model achieves a final OA of 0.745, which is marginally higher than the OA of both the 3-D-FCN and Attention U-Net models. However, the RF underperforms overall, as reflected by its Kappa coefficient of 0.544 and F1-Score of 0.712, both of which are slightly lower than those of the Attention U-Net and 3-D-FCN models. Moreover, the RF model is notably outperformed by the Swin U-Net architecture, which achieves significantly higher metrics, with a final OA of 0.936, a Kappa of 0.900, and an F1-Score of 0.932. The classification trend aligns with previous findings from the DL models, demonstrating moderate performance in recognizing the Forest and Water classes. Specifically, in the Amazon, the final recorded PA is 0.51 for the Forest class and 0.18 for the Water class. In Africa, the PAs reach 0.64 for both the Forest and Water classes, while in Siberia the PAs are 0.31 for Forest and 0.41 for Water. For other classes, the RF model exhibits significant struggles, particularly with the Shrubland (PA = 0.09), Cropland (PA=0.07), Bareland (PA = 0), and Built-up (PA = 0.02) categories in Amazonia, as well as Shrubland (PA = 0.11), Wetland (PA = 0.07), and Built-up (PA=0) in Africa. Similar challenges are evident in Siberia, where PAs for Shubland (PA=0), Wetland (PA=0.06), and Bareland (PA=0) areas are also low. These PA values indicate a marked inability to identify these classes, suggesting that the model may be inadequately trained or that these classes suffer from high intraclass variability, complicating accurate classification. This analysis underscores the limitations of the RF algorithm in contexts characterized by complex LC types. Research indicates that although RF is effective for various applications, DL approaches generally outperform it when nuanced classification is required due to their capacity to learn hierarchical feature representations [50]. The effectiveness of DL models in managing complex datasets and capturing intricate patterns in LC classification is evident. In environments such as the Amazonia, DL architectures can leverage their ability to learn hierarchical features, resulting in improved classification accuracy. For instance, studies have demonstrated that the Swin U-Net, which employs a shifted window mechanism, excels in capturing both local and global contextual information, thereby significantly enhancing classification performance [51]. Moreover, while RF remains a robust option in various scenarios, its effectiveness can be compromised by the high dimensionality and complexity of remote sensing data, particularly in heterogeneous landscapes. This highlights the necessity of employing advanced techniques for LC mapping, where DL methods can better exploit the rich information inherent in the data.

B. Analysis by Ecoregions

Further experiments were carried out considering the ground truth organized according to ecoregions. Ecoregions are geographic regions of the world that indicate the distribution of ecosystems and plant and animal communities. They have been defined in different ways, depending on the specific purpose of particular regionalization approaches. They are then understood as macroecosystems, the largest regional-scale ecosystem units, corresponding to the large climatic regions where climatic conditions are relatively uniform [52], [53]. Fig. 18 depicts the MOLCA patches of the training set for Amazonia, Africa, and Siberia according to the climatological regions. The legend reports the colormap and the numerical code of each region.

Fig. 18.

Subdivision of the MOLCA patches of the training set according to the ecoregions for (a) Amazonia, (b) Africa, and (c) Siberia. The legend provides the information on the colormap and numerical values used to identify each climatological area.

Show All

In this section, the experiments were performed by considering these climatic subdivisions within the three considered study areas. This supplementary analysis is designed to assess the results of training and testing within the same ecoregion compared to different ones, aiming to enrich the depth of this evaluation. The results obtained using the Swin-Unet model are reported in Fig. 19(a), (b), and (c) and are expressed in terms of OA. Elements on the diagonal represent the cases of training and testing within the same ecoregion, while off-diagonal elements represent cross-validation between different ecoregions.

Fig. 19.

OA matrices obtained according to possible training/test ecoregion combinations for (a) Amazonia, (b) Africa, and (c) Siberia.

Show All

In the Amazonia region [see Fig. 19(a)] the OA is typically higher when both training and testing are conducted within the same ecoregion. However, an exception is noted when using Ecoregion 14 and Ecoregion 9 for training, coupled with testing in other ecoregions. The lower performance in this scenario can be attributed to Ecoregion 14 and Ecoregion 9 having only one and four tiles, respectively, which likely limits the diversity and volume of training data.

A similar pattern is observed in the African area [see Fig. 19(b)], where lower performance is particularly noticeable when using Ecoregion 9 and Ecoregion 10 for training. These two contain only 3 and 4 tiles, respectively, which may not provide sufficient data for the model to generalize effectively across more diverse ecoregions.

In the Siberian area [see Fig. 19(c)], the OA values remain impressively high, exceeding 80% in all possible combinations. It is noteworthy, however, that the lowest values occur when training in Ecoregion 4 and testing in other ecoregions. Despite this, the accuracy still remains consistent, even with Ecoregion 4 comprising only 4 tiles. This highlights the robust generalization capabilities of the best performing model in this particular area, demonstrating its effectiveness even with limited data from certain ecoregions.

C. Comparative Analysis With SAR Time Series

This section presents a comparative analysis using standard time series data for the three considered regions to demonstrate the effectiveness of the synthesized seasonal feature extraction approach. The objective is to show that this method yields superior results compared to acquiring multiple time-step images for each season. To ensure alignment with the synthesized seasonal approach, five separate temporal images were collected per season for each MOLCA tile, culminating in a total of twenty images per tile for the year 2021. This method provided a comprehensive and consistent dataset across regions, totaling 1620 S1 images for Amazonia, 1860 for Africa, and 960 for Siberia. This structured dataset ensures high temporal resolution, enhancing the consistency and reliability of the seasonal analysis for each MOLCA tile within these distinct ecological zones. Images were carefully selected to ensure complete coverage of each tile, encompassing 100 $\%$ of the tile area. To assure data consistency and spatial completeness, images failing to meet the required spatial coverage threshold, along with tiles lacking sufficient temporal coverage, were systematically excluded from the dataset. This selective process preserved the quality and uniformity of the dataset, retaining only tiles with robust temporal representation suitable for detailed analysis. As a result, the finalized dataset included 81 MOLCA tiles for Amazonia, down from the initial 86; 93 tiles for Africa, reduced from 103; and 48 tiles for Siberia, compared to an initial 64. The seasonal distribution of the acquired S1 time series is illustrated in Table III.

Notably, in Amazonia, Africa, and particularly in Siberia, the limited availability of multitemporal data leads to some information loss, as fewer tiles meet the requirement of five acquisitions per season.

This analysis identifies two following primary limitations:

Higher computational cost: Acquiring multiple images at different time steps increases data storage and processing requirements compared to synthesizing information from extracted seasonal features;
Limited data availability: In certain regions, limited access to consistent multitemporal data restricts comprehensive analysis capabilities.

Moreover, a key advantage of utilizing the synthesized seasonal feature extraction lies in the elimination of any need to impose conditions related to spatial coverage. This is achieved as the seasonal temporal dynamics are effectively condensed through the computation of the “super image,” which integrates seasonal variations into a single, comprehensive image. This approach simplifies the process, ensuring that temporal information is consistently represented without gaps or additional spatial constraints, thereby enhancing both the efficiency and coherence of the dataset.

The results of this comparative analysis, utilizing the best performing DL model (Swin-Unet) and the RF standard approach, are shown in Table IV. Both methods utilize the same training and validation sets, ensuring a fair comparison of their performance metrics. The analysis demonstrates a general decline in performance when using standard time-series images compared to the synthesized seasonal features approach, as indicated in Table II.

SECTION V.

Discussions

The study presented a comprehensive evaluation of three DL models—specifically the Swin-Unet, 3-D-FCN, and Attention-Unet—across various geographical regions: Amazonia, Africa, and Siberia. The results, as indicated in Table II, reveal that the Swin-Unet model consistently outperforms the other two architectures in terms of OA. The model achieved OAs of 93.3 $\%$ , 93.6 $\%$ , and 97.4 $\%$ for Amazonia, Africa, and Siberia, respectively, demonstrating its robustness in handling SAR image data across diverse ecosystems.

The robustness of the Swin-Unet model is further illustrated in Table V, which provides a comprehensive analysis of classification accuracy across various random states. The random state, a parameter used to control the shuffling and splitting of the dataset, ensures reproducibility by generating consistent train-test splits. Using a fixed train-test split ratio of $70\%-30\%$ , the model's performance was assessed by varying this parameter. The results demonstrate the consistency of the Swin-Unet model and underscore the model's ability to handle variations in training data distribution, which is crucial for real-world applications.

In addition to its robustness, the Swin-Unet model demonstrates significant computational efficiency, as shown in Table VI. The performance evaluation of the three DL models presented in this table provides an in-depth analysis of frames per second (FPS) and model complexity at varying inference batch sizes (BS), following the methodology outlined in [54]. This comparison highlights the computational advantages of the Swin-Unet, which achieves the highest FPS across all batch sizes and demonstrates the best speed-up factors, particularly at higher batch sizes, compared to the Attention U-Net and 3-D-FCN models. Simulations and FPS evaluations were conducted on a NVIDIA GeForce RTX 3070 GPU high-performance workstation, with 8192 MiB of dedicated memory and 128 GB of system RAM.

As shown in Table VI, the FPS increases with the batch size due to GPU parallelization, saturating between BS 16 and 32. This trend underscores the importance of batch size in optimizing inference speed, particularly when handling high-dimensional spatio-temporal data such as the 28 seasonal features used in this work. Furthermore, the Swin-Unet combines high classification accuracy with the lowest model complexity ( $25.16\text{M}$ parameters) among the three models tested, while the 3-D-FCN model, in contrast, experienced out-of-memory (OOM) issues at higher batch sizes. This combination of robustness, computational efficiency, and lower model complexity reinforces the suitability of Swin-Unet for large-scale LC mapping tasks.

A critical aspect of the study, beyond the robustness and computational efficiency of the Swin-Unet model, was its ability to effectively classify LC classes, particularly the Forest, Grassland, Cropland, and Water categories, which were extracted with high accuracy across all study areas. Conversely, the classification of the Bareland class proved more challenging for CNN-based methods, which struggled to distinguish it from similar classes such as Water and Grassland. The overlap in pixel value distributions, highlighted by the histogram analysis, underscores the inherent difficulties in classifying these LC types. In Africa, the 3-D-FCN model exhibited lower performance in recognizing the Bareland class, achieving a PA of only 0.36, while the Swin-Unet model significantly improved this to 0.93. The challenges faced by the other models could stem from their reliance on locality-dependent attention mechanisms, which may not adequately capture the complex spatial relationships inherent in certain LC types. Furthermore, the analysis of the RF algorithm against the DL models indicates a general trend of lower performance by RF, particularly in heterogeneous landscapes such as those found in Amazonia and Siberia. Despite achieving comparable results to some DL models in specific scenarios, its OA and Kappa coefficient were inferior when compared to the Swin-Unet, which showcased superior capability in extracting intricate patterns from the data.

The investigation into ecoregion-specific performance indicates that the Swin-Unet model performs better when the training and testing datasets share similar ecoregional characteristics, yielding higher OAs. This is likely because consistent ecoregional characteristics reduce the variability in LC features and backscatter signatures within the datasets, enabling the model to learn more representative patterns and generalize effectively within the same ecoregion. Smaller ecoregions with limited data pose significant challenges, particularly in Amazonia and Africa, where ecological diversity introduces variability in vegetation types, LC patterns, and climatic conditions. This diversity often results in overlapping backscatter signals in SAR imagery, making it difficult for the model to distinguish between similar LC classes. In addition, the presence of mixed LC types within small tiles further complicates training and limits the model's ability to generalize effectively across the region. In Siberia, the model's ability to generalize is more apparent due to the region's relative climatic and ecological uniformity. In conclusion, the analysis of ecoregion-specific performance highlights the importance of considering regional characteristics in remote sensing applications. While the Swin-Unet model demonstrates strong generalization capabilities, especially in Siberia, its performance is constrained by limited training data in smaller ecoregions in Amazonia and Africa. The study reveals the detrimental effect of limited training data on model generalization, especially in ecologically diverse regions. Increasing the number of training tiles in underrepresented ecoregions would likely enhance performance. Future work should focus on enhancing training datasets, employing advanced data augmentation techniques, and developing ecoregion-aware models to improve LC classification in diverse ecological contexts. These improvements could significantly enhance the applicability of remote sensing in addressing global environmental challenges.

The incorporation of ecoregions into this analysis is particularly significant. Ecoregions represent geographically distinct areas characterized by specific climatic, ecological, and biological attributes, making them vital for understanding and mitigating the impacts of climate change [55]. For climatologists working on ESA's Climate Change Initiative (CCI+), ecoregions provide a structured framework to evaluate how environmental factors influence LC dynamics. Aligning remote sensing products with ecoregion classifications enables the development of models tailored to the unique environmental conditions and ecological processes of each area. This alignment enhances their accuracy and increases their applicability in region-specific studies and decision-making processes. This approach not only enhances the precision of LC mapping but also facilitates more targeted analyses of climate-related phenomena, such as deforestation, desertification, or wetland degradation. The refined SAR-based products generated by the proposed method hold immense potential as robust tools for monitoring and addressing climate change challenges across diverse ecological zones.

The challenges associated with SAR data are also a significant consideration in this study. SAR data is inherently prone to speckle noise, a granular interference that arises due to the coherent nature of radar signals. This noise can obscure fine details in the imagery, reducing clarity and making accurate LC classification more challenging. This issue is particularly pronounced in heterogeneous regions, where the backscatter signatures of different LC types, such as Bareland and Water, often overlap, complicating classification tasks. In addition, variations in acquisition geometry and the high sensitivity of SAR to surface moisture and roughness adds another layer of complexity, as these factors can alter the backscatter signal and create ambiguities in classification. Addressing these challenges requires advanced preprocessing techniques, such as speckle filtering and radiometric normalization, as well as model architectures capable of extracting robust features despite these inherent limitations. Future research should also explore the integration of complementary datasets, such as optical imagery, to mitigate these challenges and enhance the overall performance of SAR-based LC classification. For instance, optical imagery can provide detailed spectral information that complements the structural data captured by SAR, improving differentiation between similar LC types such as Bareland and Water LC classes. In addition, combining SAR's all-weather capabilities with optical data's sensitivity to vegetation and soil properties could offer a more comprehensive understanding of dynamic environmental processes.

In addition, the comparative analysis with SAR time series further highlights the advantages of the synthesized seasonal feature extraction method employed in this study. Consolidating seasonal variations into a single “super image” simplified the data processing pipeline, achieving greater classification accuracy. In other words, instead of analyzing images from different seasons individually, they are combined into a single representation that retains relevant temporal information. This allows for more precise results and reduces the computational load required for the classification process. Overall, the results of this study demonstrate the efficacy of the Swin-Unet model for LC mapping using SAR imagery, particularly in complex ecological regions. The findings suggest a promising avenue for future research, which could focus on integrating additional features and employing advanced modeling techniques to further improve classification performance across diverse environmental contexts.

SECTION VI.

Conclusion

This study showcases the potential of DL architectures in effectively mapping LC types using SAR data. The novelty of this study lies in the combination of the transformer-based Swin-Unet model with seasonal synthetic features. This unique approach leverages the advanced capabilities of transformers along with tailored synthesized spatio-temporal images to achieve excellent results in terms of OA. The results indicate that the proposed approach outperforms traditional CNN-based architectures, especially in regions with diverse environmental and climatic characteristics. Furthermore, leveraging multitemporal SAR data for climate change analysis can provide crucial insights into environmental shifts in sensitive regions, utilizing the synthetic information provided by seasonal features to bridge temporal gaps, particularly in critical regions such as the Siberian zone. Future research will focus on enhancing model generalization across diverse ecoregions through adaptive or transfer learning techniques. Moreover, refining data preprocessing methods to address quality issues inherent in SAR data is essential, as it could significantly advance the field, leading to more precise and reliable LC mapping outcomes. These future directions hold significant promise for advancing the effectiveness and applicability of DL-based approaches in LC mapping using SAR data at large scale.

References is not available for this document.

A Deep Learning Architecture for Land Cover Mapping Using Spatio-Temporal Sentinel-1 Features

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction

A. Key Contributions

Proposed Approach

A. SAR S1 Data PreProcessing

B. Feature Extraction

1) Multitemporal Despeckle Filtering

2) Spatio-Temporal Feature Computation

C. Network Preprocessing

D. Swin-Unet

E. Swin-Unet Model Configuration

Input Data and Test Areas

A. S1 Mission and Data

B. Test Areas

C. Training Set Generation

Results

A. Results Across Different Study Areas

B. Analysis by Ecoregions

C. Comparative Analysis With SAR Time Series

Discussions

Conclusion

References

IEEE Account

Purchase Details

Profile Information

Need Help?

A Deep Learning Architecture for Land Cover Mapping Using Spatio-Temporal Sentinel-1 Features

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction

A. Key Contributions

Proposed Approach

A. SAR S1 Data PreProcessing

B. Feature Extraction

1) Multitemporal Despeckle Filtering

2) Spatio-Temporal Feature Computation

C. Network Preprocessing

D. Swin-Unet

E. Swin-Unet Model Configuration

Input Data and Test Areas

A. S1 Mission and Data

B. Test Areas

C. Training Set Generation

Results

A. Results Across Different Study Areas

B. Analysis by Ecoregions

C. Comparative Analysis With SAR Time Series

Discussions

Conclusion

Authors

Figures

References

Keywords

Metrics

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?