ScoreSeg: Leveraging Score-Based Generative Model for Self-Supervised Semantic Segmentation of Remote Sensing

The performance of semantic segmentation of remote sensing images (RSIs) heavily depends on the number of pixel-level annotations. In practice, the accumulation of pixel-level annotations for large RSIs is quite expensive or even impossible under certain scenarios. Here, we try to solve this data-intensive problem from the novel aspect of score-based self-supervise learning (SSL) and introduce a robust RSI semantic segmentation model called ScoreSeg. Unlike traditional pixel-level SSL paradigms, the generative SSL mechanism in ScoreSeg is simple in loss design and stable in pretraining, granting it an indispensable ability in dense feature learning from very large RSIs. In the model implementation, ScoreSeg first extracts pixelwise representations of RSIs by pretraining a time-dependent score-based model on abundant off-the-shelf unlabeled RSIs. Then, to address the sparse feature problem in RSIs, the collected features from different timesteps and resolutions are aggregated together forming a rich feature map for downstream semantic segmentation. Experimental results on three datasets show that our proposed ScoreSeg outperforms state-of-the-art (SOTA) SSL methods and alternative pretraining models on ImageNet by nontrivial margins, especially with very limited annotations.

ScoreSeg: Leveraging Score-Based Generative Model for Self-Supervised Semantic Segmentation of Remote Sensing used in a broad range of real-world applications such as urban planning, meteorological analysis, and natural disaster prediction [1], [2], [3].Semantic segmentation of RSIs aims to perform pixelwise classification of input images, playing an irreplaceable role in the field of RSI interpretation.
Compared with natural imagery, the complicated spectral responses of multiple objects caused by different sensors, angles, and weather conditions make semantic segmentation of RSIs particularly challenging.The performance of semantic segmentation of RSIs has been greatly improved due to the recent advances in deep learning [4], [5], [6].However, these deep learning-based RSI semantic segmentation methods are heavily data-driven and may require a large level of pixelwise labels of RSIs.Apparently, these labor-intensive annotations are not only expensive but also very inefficient for some emerging tasks.Moreover, as the geographical environment changes, models without sufficient generalization need additional labels to function properly.Therefore, it is necessary to explore some new semantic segmentation approaches with fewer label consummations for RSIs.
In recent years, studies on self-supervised learning (SSL) have paved the way to alleviate reliance on mass labeled data [7], [8], [9], [10], [11], [12].The SSL methods learn representations by pretraining on unlabeled data and then transfer the learned results to downstream tasks by fine-tuning on limited labeled data.The core of SSL methods is in designing pretext tasks and the corresponding loss functions [7].Among them [7], [8], [10], [11], [12], contrastive SSL methods have been becoming a prevalent solution for SSL in the remote sensing field.Notwithstanding this progress, the pretext tasks in contrastive SSL methods are mainly developed for image-level tasks and are not optimized for pixel-level dense prediction as required in semantic segmentation.For example, some recent remote sensing (RS) works designed and combined multiple objectives cautiously based on contrastive learning [13], [14], [15].In spite of achieving promising results, it may be complicated to choose suitable objectives and adjust their weights manually for specific tasks and datasets.The advancement of SSL methods for semantic segmentation of RSIs remains constrained.
Here, we propose to tackle the SSL on RSIs from the novel perspective of score-based generative learning.Scorebased [17], [18] or diffusion [19], [20] models are a new family of generative models originally designed for image synthesis.
As the name implies, the core of score-based generative models is to estimate score functions (defined as gradients of log probability density functions) and utilize them for sampling (i.e., generation).Different from other generative models such as variational autoencoders (VAEs) [21] and generative adversarial networks (GANs) [22], the score-based models have several unique characteristics which are highly desired in our SSL for semantic segmentation of RSIs.
1) Great training stability: Score-based generative models can be trained efficiently without any adversarial strategy and scale well with more data and bigger model [18], [23], implying that they could take full advantage of the huge amount of off-the-shelf unlabeled RSIs.2) Coherent visual semantic concepts: A wide range of studies have investigated the use of generative models as representation learners.Among them, the score-based models have been shown to learn coherent and semantic concepts in their internal representations of natural images [24], [25].3) Representations of multiresolution: The current state-ofthe-art (SOTA) score-based models are mostly implemented by UNet architectures [17], [18], [23], [26], and many tasks in computer vision, including semantic segmentation, can be boosted by its built-in multiscale features [27], [28], [29].4) Representations conditioned on timesteps: The scorebased models are always conditioned on timesteps during both training and inferring, which can be used to train a more robust segmentation model even with very limited labeled samples.Inspired by the above insights, in this work, we introduce score-based models to expand the SSL paradigms of semantic segmentation in remote sensing.It is worth noting that there has been research [24], [30] on exploring the representational learning ability of score-based models.However, most of these studies are based on natural images, which differ significantly from RSIs.In this work, we focus on leveraging the rich representations from score-based models for semantic segmentation of RSIs.
Through visualization and experimental analysis, we observe the multitimesteps and multiscale representations in our pretrained score-based model can alleviate the sparse characteristic of RSIs caused by multiple objects from one image and complex backgrounds.As shown in Fig. 1(b) and (c), with the resolution of feature maps decreasing, the representations vary from fine-grained to coarse, focusing on more semantic information than typical backbones such as ResNet [31].The black car in the shadow can be clearly distinguished in different timesteps as shown in Fig. 1(d) and (e).Thus, we combined information in different timesteps from the score-based model and designed the multiscale fusion module based on the deformable transformer [32] to aggregate multiresolution features from the implemented UNet.Benefiting from the powerful feature extraction capability of the pretrained score-based model and the sparse representation aggregation of RSIs by the above modules, we achieved promising results using only a small amount of annotated data on several datasets, e.g.,  [16] patches without any supervision but can extract rich semantic representations that proven to be still efficient without any further fine-tuning on downstream segmentation tasks.The representations from different timesteps can complement each other, which helps train a more robust segmentation model.55.60% mIoU (+5.93% over other SSL methods and +8.43% over the ImageNet-supervised method) on the Potsdam dataset with 1% annotations.We further studied the influence of pretraining data on ScoreSeg, which indicates that ScoreSeg can well adapt to pretraining on mixed domains of RSIs.Empirically, we also present that pretraining the ScoreSeg on mixed domains has few side effects on downstream tasks, which is of great promise in practice because the RSIs are very diverse in space and time in the wild.
In general, our main contributions are summarized as follows.

A. Score-Based Generative Models
As a promising new direction for generative modeling, scorebased generative models [17], [18] are trained to learn score functions on a large number of noise-perturbed data distributions.This modeling approach allows for a simple training process and flexible network architecture.Score-based generative models have recently succeeded in image generation [18], [26], outperforming those from other generative models such as VAEs [21] and GANs [22].Score-based generative modeling also shows promising results on several downstream applications, including representation learning [33], image inpainting [34], and semantic segmentation [24].However, in the field of remote sensing, there are few research works on score-based generative models.To our best knowledge, there is no relevant work in the field of semantic segmentation of RSIs yet.The one closest to ours is about remote sensing change detection [25], which also proposed to pretrain a score-based model for feature extraction.However, we argue that the sparse features of RSIs and the corresponding feature fusion methods were not considered there while being proven essential for semantic tasks such as segmentation in this work.

B. Self-Supervised Learning in RSIs
As of today, the contrastive learning paradigm dominates SSL methods in RS fields [35].Nevertheless, we first review related works on generative methods and compare our proposed approach with them.The method most closely associated with our ScoreSeg is denoising autoencoders (DAEs) [36], which has a long history of research.DAEs have been less studied in recent years due to their relatively high data requirements and low representation quality.However, score-based models are essentially different from DAEs, including objectives and model structures.The score-based models are always conditioned on timesteps (usually injected into normalization layers in networks [17], [20]), which breaks down the problematic noise prediction to be learned step by step, reducing the dependence on large amounts of data and encouraging rich representations.At present, another promising generative SSL paradigm is masked image modeling (MIM), such as MAE [9], which reconstructs the masked image patches.In addition, semisupervised learning techniques are also studied widely including pseudolabeling [37] and consistency regularization [38].
Compared with the generative SSL methods, the contrastive ones benefit from their flexibility in designing pretext tasks, for which consistent pretext tasks can be matched with various downstream tasks, and combined with specific prior knowledge to achieve better performance.Chen et al. [39] observed the "imbalanced fineness" and "imbalanced appearances" problems of fine-grained ship classification and proposed "Push-and-Pull Network (P2Net)" based on the instance discrimination [40], embedding image features between subclasses into a more decoupled space to classify them more easily.Mikriukov et al. [41] trained image and text encoders by generating augmentations for both modalities and optimizing the contrastive objective to obtain deep representations for text-image retrieval.
Because many corresponding pretext tasks are proven to be easy and efficient, such as instance discrimination, the contrastive SSL methods achieve promising results on many imagelevel tasks of RSIs as listed above, while the progress on dense prediction tasks such as semantic segmentation is limited.Nevertheless, there have been some recent works attempting to tackle this problem by introducing auxiliary objectives.For example, Li et al. [15] combined contrastive learning with the inpainting and augmentation transform prediction tasks to learn the high-level and low-level image features at the same time.Also based on default image-level learning branches, GLCNet [14] and IndexNet [13] force the networks to learn pixel-level representations via additional branches of local features matching and tracking object positions separately.

C. Semantic Segmentation in RSIs
For semantic segmentation, fully convolutional neural networks (FCNs) [42] are the first demonstration of the potential of deep learning.To date, PSPNet [28], deepLab series [27], UNet [43], HRNet [29] and other influential models have laid the foundation of semantic segmentation.For research on semantic segmentation in RSIs, the above models are introduced and combined with the specific design of RSI characteristics, which also obtain great performance in most dataset evaluations.For example, based on HRNet, Xu et al. [44] proposed a lightweight feature-selection module to aggregate features of multiscale objects in very-high-resolution (VHR) RSIs.ResUNet-a [6] is an enhanced version of the original UNet architecture that incorporates various modifications, including residual connections and atrous convolutions.
With the emergence of attention mechanisms [45] in computer vision [46], many transformer-based methods of semantic segmentation of RSIs have been developed.For example, UNet-Former [5] constructed an UNet-like encoder-decoder structure and used a lightweight encoder and a transformer-based decoder to model global and local features for real-time urban scene segmentation.In practice, a common method is to replace the original CNNs (e.g., ResNet50) with transformers with a larger number of parameters and computation complexity as the backbone to extract features while keeping the segmentation model framework unchanged to get better results [4].
The multimodal nature of remote sensing images (RSIs) has prompted the advent of numerous semantic segmentation techniques leveraging multiple RS modalities including optical, hyperspectral [47], [48], and SAR data.He et al. [49] proposed the intuition-inspired hypergraph network (I2HN), using a neuroscience-inspired approach to model intramodal and intermodal relationships.Hong et al. [50] developed a multimodal deep learning (MDL) framework, focusing on the strategic fusion of diverse modalities, which enhanced performance in complex pixelwise scene classification.In the work of Ma et al. [51], a cross-modal attention mechanism, coupled with transformers, was devised to capture long-range dependencies across multiscale feature maps, underlining the potential of multimodal deep learning in RSIs segmentation.

III. PRELIMINARIES ON SCORE-BASED GENERATIVE MODELING
The framework of ScoreSeg is deeply rooted in score-based generative modeling.Therefore, we will first briefly introduce the preliminaries of score functions and score matching approaches before giving the details of our RSI segmentation model.Score-based generative models aim at training a neural network to estimate score functions by methods called score matching [17].One of the most efficient approaches to estimating score functions is denoising score matching, which perturbs data with a prespecified noise (e.g., Gaussian) to estimate the perturbed score functions scalably.The perturbation process can also be viewed as discretizations to stochastic differential equations (SDEs) determined by score functions [18].In this unified framework, score-based generative models and diffusion probabilistic models [19] can both be viewed as discrete versions of continuous-time perturbations by SDEs, for which we use the terms score-based generative models to refer to this model family (i.e., learning a denoising process under SDEs) in this work.
Here, we provide a brief overview of the framework of score-based modeling from classical noise-conditioned score networks (NCSNs) [17].Given the probability density of data distributions p(x), the score function is defined as ∇ x log p(x), and its estimation from the neural networks is defined as s θ (x).Our target is to minimize the Fisher divergence between the model estimations and data distributions, defined as However, in our optimization objective (1), the ground-truth score function ∇ x log p(x) is unknown.To circumvent this problem and also alleviate the inaccurate estimation in low data density regions [17], we can perturb data points with noise and train score-based models on the noisy data points instead.To control the scale of noise, we introduce a variable t called timestep.Specifically, using the denoising score matching technique, we can perturb the clean data x 0 with Gaussian noise q 0t (x t | x 0 ) = N (x t | x 0 , σ 2 t I) and obtain a noise-perturbed distribution p σ t (x) = p(y)N (x; y, σ 2 t I)dy (abbreviated as p(x t )), where the variance σ 2 t of the Gaussian distribution is time-dependent and I is an identity matrix.The noise-perturbed data x t can be obtained by x t = x 0 + σ t ε with ε ∼ N (0, I).When t is large enough, the data can converge to the designed prior distribution (usually an isotropous Gaussian, i.e., x T ∈ N (0, σ 2 T I)).To train the time-dependent score-based model, we take the noise-perturbed data x t and timestep t as the input and the condition separately.Then we use the network estimation s θ (x t , t) to fit the perturbed score function ∇ x t p(x t ) based on an L-2 loss for regression [like (1)], which was proven equivalent to optimize the following expectation term, where our target ∇ x t log q 0t (x t | x 0 ) can be computed analytically: where λ(t) is a coefficient function depending on timestep t, and s θ (x t , t) indicates that the model is conditioned on timesteps.After the NCSN s θ (x t , t) is trained, with the estimation of score functions, we can do sampling (i.e., generation) via annealed Langevin dynamics [17] or improved solvers (for faster sampling and better quality) corresponding to SDEs and ODEs [18].Recent research [18], [19], [23] has demonstrated the inherent stability of training score-based models, attributed to their nonadversarial nature, explicit density estimation, and the integration of noise injection within the denoising score matching technique.Nevertheless, in the context of generation, score-based models may encounter estimation errors during the sampling process (e.g., Langevin dynamics) [52].With a well-trained score-based model, we can extract meaningful representations from the network.Specifically, for the input image X ∈ R 3×H×W (i.e., RGB 3-channel), given any timestep t, we add Gaussian noise q 0t (x t | x 0 ) = N (x t | x 0 , σ 2 t I) to obtain the model's prediction of perturbed score functions at the timestep t.We parameterize the score-based generative models with a UNet architecture as most existing works [17], [18], [20].Then a natural question is what the score estimation network has learned in its intermediate activations.
Intuitively, these features should retain most of the information in the data well, because we can sample very realistic and natural images based on the output of the network.Moreover, these learned representations are proven to be semantically coherent and can be well transferred to downstream tasks [24], [25].

A. Overview
The main goal of ScoreSeg is to learn effective representations from diverse annotation-free RSIs during pretraining and to tackle the sparse characteristics caused by multiple objects and complex backgrounds in RSIs for downstream segmentation.As illustrated in Fig. 2, we leverage score-based models for feature extraction by pretraining on plentiful unlabeled RSIs.Then, feature fusion modules are designed to aggregate these

B. Pixel-Level SSL on RSIs via Score-Based Modeling
With the support of a large number of high-quality and easily accessible unlabeled RSIs, we can train a robust score-based generative model to carry out the general feature extraction for RSIs.Consequently, it is not necessary to conduct specific pretraining in the downstream semantic segmentation datasets as most previous SSL methods on remote sensing [13], [14], [15].Specifically, we can collect a large amount of readily available unlabeled RSIs from the Google Earth Engine and crop them into the required size as the input of our score-based generative model.As mentioned in Section III, the denoising score matching technique allows for efficient training with a simple L-2 loss function based on (2).In practice, we randomly sample a timestep t ∈ [0, 2000] uniformly and choose a cosine-like noise schedule σ t in the perturbing process [20].Then, we define the coefficient function λ(t) = σ 2 t to compute the loss function [17].
Compared with other generative models like GANs, the training of the score-based generative model is more efficient and stable with the only regression objective rather than adversarial training [22] or variational inference [21].After pretraining the score model, we can obtain rich representations from its intermediate activations of multiple timesteps and multiple image resolutions.
Compared with existing pixel-level SSL methods like contrastive modeling [13], [14], [15], there are two prominent advantages of adopting score-based generative modeling to perform SSL here.First, from the perspective of pretext tasks, the pretraining step in ScoreSeg is more stable.ScoreSeg only requires learning a simple score estimation function, while other contrastive SSL methods need to design additional tasks to learn pixel-level features, that may conflict with their original image-level branch based on instance discrimination.Second, from the perspective of loss functions, the pretraining loss in ScoreSeg is quite simple.The only SSL objective utilized in ScoreSeg is an L-2 loss while other contrastive SSL models always combine multiple objectives as the final target, which is erratic and difficult to validate in the pretraining stage.
Unlike traditional generative models, score-based generative models have the unique time-dependent characteristic.We can hence take advantage of this time-dependent merit to get rich feature representations from multiple sampling steps.Formally, given an RSI x 0 , we first choose several fixed timesteps t = {t 1 , t 2 , . .., t N } to calculate the corresponding noise σ t (the linear schedule is used during inferring) and sample the noise-perturbed images x t by x t = x 0 + σ t ε with ε ∼ N (0, I), t = {t 1 , t 2 , . .., t N }.It can be noted that when the timestep t is too large (i.e., close to T ), due to large variances σ t , the noisy images lose most of the useful information and misguide the consequent predictions.However, empirical evidence shows that direct learning from x 0 clean data can Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Then, the collections of noise-perturbed images x t and their corresponding sampling timesteps t are sequentially passed into the pretrained score estimation network s θ (x t , t) for obtaining the corresponding feature representations at different time steps.Five layers at different resolutions of the UNet decoder are collected as the feature maps corresponding to a single timestep input.We noted that these collected raw features here are both fully time-resolved (because of multiple sampling steps) and spatial-resolved (because the Unet decoder has multiple layers).As we have emphasized in the introduction and visualized in Fig. 1, both the multiscale and multitimesteps representations can be leveraged to train a better segmentation model.We will elaborate on how we fuse these abundant feature maps both temporarily and spatially in the next section.

C. Multitimesteps and Multiscale Feature Fusion
The design of the multitimesteps fusion method can be flexible and here we try both nonparametric and parametric ways.We set the concat way, i.e., only doing concatenation on the channel dimension as the baseline while the mean way is to average on all corresponding positions for multitimesteps feature maps.Meanwhile, two methods with parameters are constructed, both of which calculate weighted averages at corresponding positions.The weighted mean method takes weights as learnable parameters, while the SE mechanisms [53] recalibrate the feature maps adaptively.The results are analyzed in Section V-D2.
The multiscale representations have been proven to be pivotal to a majority of SOTA segmentation models based both on CNNs and transformers [5], [27], [28], [29].Following current works [24], [25], we consider utilizing features in the decoder of pretrained UNet which are of multiresolution innately.Baranchuk et al. [24] empirically show that the extracted features correspond to coarse semantic masks in shallow (i.e., higher resolution) UNet blocks and become more fine-grained but exhibit less semantic meaningness in deep blocks.Nevertheless, to our best knowledge, these works did not probe into integrating these multiscale features and only tried pixelwise MLP [24] or UNet-decoderlike structure [25].Here we point out that these methods might be suboptimal for semantic segmentation of RSIs because of the multiobject characteristic and their intricate interactions, which require long-distance sparse modeling to capture the discriminative features.
Drawing inspiration from prominent self-attention mechanisms [32], [45], we propose using a deformable transformer to address this issue.The method allows each feature vector to interact sparsely with multiresolution feature maps.To formalize, let us denote the input image as X ∈ R 3×H×W and its feature maps extracted at the last block in each resolution of the UNet decoder as Z i ∈ R C i ×H i ×W i (i = 1, 2, . ..N).In this case, Z i represents the fused multitimestep feature maps with the same shape For practical implementation with an input size of 256 × 256, we extract feature maps at five different resolutions ranging from 256 × 256 to 16 × 16 (N = 5).We deviate from typical transformer encoders like ViT [46] to maximize the capture of fine-grained pixelwise representations.In other words, we do not crop the feature maps into patches or perform downsampling like them.
We generate the single-scale feature vector V i ∈ R d×(H i ×W i ) using the following: In this equation, Conv is a 1x1 convolution with the output channel d = 256.Flattening is performed on spatial dimensions, i.e., height and width.The position embedding pos i in (3) represents the feature maps index pos ind ∈ R d×(1×1) and the pixelwise location pos spa ∈ R d×(H i ×W i ) on the specific feature map.This is further expressed in the following: Here, the "Fuse" function is a broadcast addition where pos ind are learnable parameters and pos spa is the standard sinusoidal position encoding.By concatenating all single-scale feature vectors V i (i = 1, 2, . ..N), we can form the global dense feature vectors To conserve memory and computation cost for self-attention mechanisms, we predict K sampling offsets for every query z ∈ R d from the global dense feature vectors V based on its position p ∈ R 2 on each single-scale feature vector.This method replaces the need for direct interactions with all keys from V .
Assuming the deformable attention feature for the mth head (m = 1, 2, . ..M) out of M total self-attention heads is denoted as f m (z).This feature is calculated using the following (For simplification, we omit the index m in the right-hand side of equations): p = Normalize (Linear(z)) .
In these formulas, k denotes the index of sampling offsets.
represents the learnable linear value projection of V i .W ∈ R N ×K and p ∈ R N ×K×2 signify the attention weights and sampling offsets, respectively.The Softmax function ensures the weights add up to 1.The φ i and "Normalize" functions are employed for scaling during sampling on different scales.To make the process differentiable, we execute V proj i (•) through bilinear interpolation based on sampling offsets.For all sampled feature vector V proj i In practice, for multihead attention, we denote the outputs f m (z) for all queries z from the mth head as . By concatenating these outputs and integrating a skip-connection and a fully connected feedforward network, similar to the standard transformers [45], we can achieve enhanced global dense feature vectors It's important to note that these feature Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.vectors possess the same shape as the original input V , which means the deformable attention can be executed iteratively.
After several deformable self-attention operations, the network can capture long-distance discriminative features from sparse representations in RSIs that are sent to the segmentation head as input.Compared with prior feature fusion methods, the CNN-based ones have limited receptive fields and lack long-distance modeling ability while the transform-based ones sacrifice computation complexity due to interactions between all pairs of feature vectors.We show some visualization results in Fig. 3, where the deformable transformer incorporated in ScoreSeg is relatively better.In addition, in the multiscale fusion, the widely used pooling and atrous convolution [28] will lead to inevitable information loss, partly affecting the smoothness and integrity of the final segmentation results.To further prove it, we conduct an ablation experiment using convolution layers to downsample before our multiscale fusion and upsample after it.We observe that the additional operations on resolutions do have a bad effect on the performance of semantic segmentation of RSIs (see Section V-D1).

D. Segmentation Head
After the feature fusion modules on both multitimesteps and multiresolution, we take the fused single-scale feature vector V 1 from the enhanced global dense feature vectors V and reshape it as the same resolution of the input image Z 1 ∈ R d×H×W .We use two convolution layers with Relu activation function to output the segmentation maps Re = Conv2(Relu(Conv1(Z 1 ))).where the kernel size of Conv1 and Conv2 is both 3. To train the whole segmentation model, we apply a combination of a crossentropy loss L CE and a dice loss L DICE as our loss function L, i.e., L = αL CE + βL DICE .We set the hyperparameter α = β = 1 to balance the weight by default.

V. EXPERIMENTS
In this section, we first present the used datasets and implementation details of our proposed ScoreSeg.Then, we compared its performance in semantic segmentation of RSIs with several state-of-the-art SSL methods using very limited annotated data.Finally, some ablation studies are conducted on the efficiency of the designed multitimesteps and multiscale feature fusion modules and the influence of mixed domains pretraining.

A. Data Description
Experiments are conducted on the following three datasets.1) ISPRS Potsdam Dataset: It contains 38 tiles of image patches with four spectral bands (red, blue, green, and nearinfrared) and corresponding DSMs.The patches have the same 6000 × 6000 pixels size with a spatial resolution of 5 cm.Excluding the boundary category for testing, there are six categories in this dataset including impervious surface, building, low vegetation, tree, car, and clutter.In our experiments, we only utilize 3 channels (red, blue, green) and crop these tiles into a fixed size of 256 × 256 pixels without overlap for pretraining other SSL methods (As we mentioned in Section IV-B, we do not pretrain our score-based model further on any segmentation dataset).For fine-tuning, Following the benchmark setting, we select 24 tiles of images for training and the remaining 14 tiles for testing.As a result, there are 13 248 images for training and we randomly select 1500 images from the above cropped 14 tiles for testing.
2) ISPRS Vaihingen Dataset: It is composed of 33 tiles of IRRG (nearinfrared, red, and green) image patches and has the same category information as the Potsdam dataset.The spatial resolution is 9 cm, and the average size is 2494 × 2064 pixels.We crop them to 256 × 256 pixels without overlap, yielding 2838 images.In our experiments, all of these cropped images are used for pretraining and 16 tiles (1324 images) are used as a training set, while the remaining 17 tiles are used as a testing set.
3) DeepGlobe Land Cover Classification Dataset: The DeepGlobe Land Cover Classification Challenge is a public dataset that provides high-resolution submeter satellite imagery with six foreground classes (urban land, agriculture land, rangeland, forest land, water, and barren land) and one background class (unknown).There are 803 available annotated tiles of RGB image patches with the fixed 2448 × 2448 pixels size.In our experiments, We randomly select 100 tiles from these tiles and crop them into 10 000 nonoverlapping images with 256 × 256 pixels size for training.Among the remaining tiles, 100 tiles are randomly chosen and cropped into 10 000 nonoverlapping images and we use 1500 of them as a testing set.
We provide more details about the above three datasets in Fig. 4. It can be seen that these three datasets all have obvious class imbalance problems.Nevertheless, to test the effectiveness of ScoreSeg, we intentionally do not carry out any special design for difficult cases during fine-tuning on segmentation tasks such as oversampling the minority class or using a weighted loss function.

B. Implementation Details 1) Settings of Pretraining:
As previously discussed, we base our score-based generative model on the UNet architecture.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.We utilize the readily available pretrained model provided by Gedara et al. [25] for convenience.The model's implementation adheres closely to the approach described in SR3 [55], and it is trained using approximately 1 million RGB RSIs from Sentinel-2 [16].These images consist of 256 × 256 pixel patches, and the dataset was collected by SeCo [56].Although the training objective is quite simple as we introduced in Section III, the standard training skills of score-based models such as preconditioning and noise schedulers are actually intricate.For a detailed breakdown of the pretraining process, including the UNet architecture and hyperparameters used, please refer either to SR3 or to our open-source code repository.To verify the generalization of our proposed method, we do not pretrain the score-based model on downstream segmentation datasets further.
For comparison of performance, using RSIs in target segmentation datasets (i.e., the same as fine-tuning datasets), we pretrain a ResNet50 with several typical SOTA SSL methods in computer vision including MoCo v2 [8], SimCLR [11], Simsiam [12], and DenseCL [10] while a ViT-Base [46] is pretrained with MAE [9], which is an influential representative of masked image modeling.Besides, to prove its adaptivity to semantic segmentation of RSIs, we also reproduce the SOTA pixel-level SSL methods from the remote sensing community such as MultiTask [15], GLCNet [14], and IndexNet [13], i.e., pretraining a ResNet50 implemented by their official projects.All these segmentation models are trained with 2 NVIDIA 3090Ti GPUs with distributed data-parallel techniques in PyTorch.For those SOTA SSL methods in computer vision, we implement them based on the MMSelfSup codebase [57] and train them for 400 epochs using the default settings except for adjusting learning rates for different batch sizes according to the linear scaling rule.For the mentioned methods of SSL for the semantic segmentation of RSIs, we keep networks and pretraining settings the same as their papers.
2) Settings of Fine-Tuning: When fine-tuning pretrained models for downstream semantic segmentation of RSIs, we use a limited amount of annotated samples to train.To be more For comparison of performance, we use the ImageNet pretrained ResNet50 as the baseline.With the pretrained ResNet50 and ViT-Base by the mentioned SSL methods in Section V-B1.DeepLabV3+ and UperNet [58] are used for segmentation, or, more precisely, DeepLabV3+ with the ResNet50 backbone while UperNet with the ViT-Base backbone.We implement them based on the MMSegmentation codebase [59] and follow its default hyperparameter settings.During fine-tuning, for DeepLabV3+, we use the SGD optimizer for 200 epochs with a total batch size of 16 (eight images per GPU) while we use the AdamW optimizer for UperNet.The details of training settings including data augmentations and learning schedules can be found in the MMSegmentation codebase.We emphasize that these backbones are trained with their segmentation modules end-to-end, which helps get better results than fixing weights empirically.

C. Experimental Results
In this section, we compare the performance of our ScoreSeg with other SSL methods and ImageNet-supervised baseline on multiple datasets with limited annotated data comprehensively, including several overall metrics, individual class IoU, parameter number, etc.We also try to analyze the reasons for  the different performances of these methods, such as insufficient pretraining data, and domain differences between pretraining and fine-tuning.
1) Potsdam Dataset: The performance evaluation of different pretraining approaches on the Potsdam dataset is shown in Table I.We list some highlights of our observations below.
1) In all label size fine-tuning settings, ScoreSeg achieves state-of-the-art results in all SSL methods, even surpassing the ImageNet pretraining method, which employs millions of well-annotated images.2) Given very limited samples, i.e., fine-tuning on only 1% labeled data, ScoreSeg obtained significantly superior performance than other methods, implying that the features extracted from the score-based model can be well transferred to downstream semantic segmentation of RSIs.3) While the ImageNet pretraining method performs promisingly if fine-tuned on more data, it does not show an advantage over other SSL methods in the setting of 1% labeled data.One of the contributing factors is that the representations learned by pretraining on natural images should be further modified to fill the gap with RSIs.To confirm our points, we test the segmentation mIoU with different fine-tuning epochs as shown in Fig. 5. ScoreSeg surpasses others significantly at the beginning of fine-tuning, i.e., at the 20th epoch, while the ImageNet pretraining leads to poor results.In addition, owing to the time-dependent characteristic of score-based modeling and our multitimesteps strategy, the training process is also more stable during fine-tuning on downstream tasks.We also calculate the IoU of each class as shown in Fig. 6, and compared with other SSL methods, ScoreSeg takes advantage of most classes, especially for the clutter category, which is sparse and indistinguishable in the Potsdam dataset.Meanwhile, we evaluate the scale of these networks (only the trainable parts are considered), including the number of parameters and the floating point operation count (FLOPs).As shown in Table II, in terms of trainable parameter number, the parameters in our feature fusion modules and segmentation head are much less than typical segmentation models such as DeepLabv3+, which helps to take full advantage of very limited annotated samples.However, since ScoreSeg retains the original high resolution of the deep feature maps and performs the transformer-based feature fusion on it, although the deformable attention has significantly reduced the computing cost because attention interactions are only carried out in sampling positions instead of all positions, there is still great computational complexity, i.e., FLOPs.Referring to the implementation of HRNet [29], we use 3 × 3 convolution layers before the feature fusion to downsample.It is no surprise that with the increase in downsampling times, the computational complexity of the model decreases while the parameters increase and the performance of the segmentation results deteriorates.Despite this, the downsampled ScoreSeg still provides superior overall  performance compared to those full-resolution segmentation models which are also pretrained using SSL methods such as the UperNet with MAE.
2) Vaihingen Dataset: Compared with the Potsdam dataset, although the Vaihingen dataset shares the same categories, it has much fewer data (around a tenth of Potsdam's) and suffers a more serious class imbalance problem in which we are interested.It can be clearly seen in Table III that ScoreSeg is the only one that classifies correctly in the clutter class that is very semantically ambiguous and difficult to learn, which can be further confirmed in the visualization results in Fig. 7(b).We also find that in our experimental settings, even though other SSL methods such as MoCo v2 and MAE pretrained directly on the training and testing sets of the downstream datasets, they did not work well here, which we attributed to the fact that the insufficient size of the Vaihingen dataset resulted in these pretrained backbones did not learn good representations.Furthermore, the effect of pretraining datasets brought to ScoreSeg will be discussed in Section V-D3 later.
3) DeepGlobe Dataset: We also evaluate the performance of these pretraining methods on the DeepGlobe dataset.The semantic segmentation results are presented in Table IV.Similar to the performance on the above two datasets, ScoreSeg has significant advantages in most classes and also overall accuracy.It is worth noting that the ImageNet pretraining method which is a strong baseline in the Potsdam dataset and Vaihingen dataset, performs dramatically worse, being exceeded by several SSL methods including SimCLR and MAE.As in Fig. 7(c), the DeepGlobe dataset has lower spatial resolution and contains fewer image details, and its foreground is also more different from nature images (such as large areas of agricultural land versus cars in the Potsdam dataset), leading to a possible performance gap.In addition, those SSL methods that learn pixel-level features in contrastive styles [13], [14], [15] perform poorly in this dataset due to the weak spatial information in these agricultural lands.

D. Ablation Study
In this section, we conduct several ablation experiments to investigate the efficiency of the multiscale fusion we proposed based on deformable attention mechanisms, the choice of multitimesteps fusion methods and the effect of pretraining scorebased generative models on mixed domains.
1) Efficiency of Multiscale Fusion: Based on the multiscale representations obtained from pretrained UNet by denoising score matching, we examine how feature fusion methods affect the downstream segmentation task.Specifically, we set the single-scale method as the baseline, which only leverages the feature maps of single resolution (the same size as the original RSI) and performs no multiscale fusion.For comparison, we test the linear multilayer perceptron (MLP) used in [24], UNetdecoderlike structure, and the object-contextual aggregation method [54] to fuse feature maps of multiresolution.As shown in Tables V and VI, the multiscale fusion module is critical to achieving good performance.Among all fusion methods, the deformable transformer incorporated in ScoreSeg has the advantage in the case of the same multitimesteps fusion method.However, it is worth noting that even with the vanilla linear method, the whole segmentation model can still obtain rich representations from the frozen feature extractor and achieve good results.It may be because the score-based model does learn transferable representations during the pretraining phase, even though it is not pretrained on downstream datasets or even frozen weights during fine-tuning.
2) Choice of Multitimesteps Fusion: We test the four multitimesteps fusion methods mentioned in Section IV-C.Moreover, although we intentionally do not search for the optimal combination of timesteps for each dataset, we attempt to extend them from t = {50, 150, 300} to see if better results can be achieved with more input.
Table V shows the impact of different multitimesteps fusion methods.It can be observed that the simple concat fusion is effective for both the UNet decoder and deformable transformer methods.Compared with the single-timestep results in  appropriate aggregation of multiple timesteps can improve performances in most situations.In addition, it can be noted that when the set of t is expanded, the performance of most methods tends to deteriorate.On one hand, more input brings difficulties to the learning of networks (more parameters are required).On the other hand, it may be because when t is too large, the network extracts coarse-grained features, which is unfavorable to segmentation (also reflected in Table VI).
3) Score-Based Model Pretraining on Mixed Domains: For self-supervised learning of RSIs, the influence brought by domain differences is a problem that can not be ignored because RSIs are very diverse in data distribution.For most SSL methods, this would cause worse results on downstream tasks if pretraining and fine-tuning use dissimilar datasets [13], [14].However, in practice, since we can easily obtain RSIs varying greatly in time and space, it is much more noteworthy to investigate the  effect of pretraining on mixed domains.Therefore, in addition to the model pretrained with data collected on Sentinel-2 patches, we pretrain three models from scratch, one from the Potsdam dataset, one from DeepGlobe, and one from a combination of both datasets.We emphasize that the category of the Potsdam dataset is very different from that of the DeepGlobe dataset, and the latter has a significantly lower spatial resolution.These models are all pretrained for 200 000 iterations with the same architecture and hyperparameters as [25], and each of them takes around 2 days on 4 NVIDIA 3090Ti GPUs.Then, all score-based models with frozen weights are used to fine-tune the segmentation modules on the Potsdam dataset and DeepGlobe dataset with 1% annotated data.The results are listed in Table VII.It can be seen that pretraining on mixed domains brings few side effects to downstream tasks and on the Potsdam dataset it achieves results close to those pretrained on large-scale Sentinel-2 data.
To further confirm the effect of pretraining, we conduct some sampling on the score-based generative models to generate RSIs.As shown in Fig. 8, the one pretrained on mixed domains also performs better, e.g., Fig. 8(a) seems to assign features from car to building patches while Fig. 8(b) shows relatively intact roofs.
In brief, we empirically demonstrate the good performance of pretraining score-based models in mixed domains of RSIs, which is a problem for most contrastive learning methods especially those with negative sampling techniques [14].This is related to the apparent differences between these domains, which introduce a shortcut solution to distinguish positive and negative samples, weakening the representations learning process in the local domain.

4) Generalization of Segmentation Heads:
In the original experimental setting of our proposed ScoreSeg model, a straightforward network with two convolution layers is used as the segmentation head.The selection is primarily done to elucidate the effective integration of our fusion modules, without introducing additional complexities from the segmentation head.Inspired by other successful segmentation networks, we further investigate the impact of implementing more sophisticated segmentation heads.Specifically, we explore DeepLabV3 [60] and DeepLabV3+ [27] heads, implemented following the guidelines of the open-source project DeepLabV3Plus-PyTorch [61].
In this experiment, the features extracted postfusion are treated as "high-level representations," while the original features derived from the UNet served as "low-level representations."We maintain all other experimental settings consistently and conduct the training on the Potsdam dataset.Table VIII reports the results of our study.Interestingly, it reveals that the simple network architecture of dual convolution layers performs comparably, if not slightly better, to the more complex DeepLab series segmentation heads.We interpret this surprising result in two ways.First, the potent deformable transformer integrates the features efficiently, simplifying the alignment with discrete labels.Second, the scarcity of annotations actually turns out to be an advantage for architectures with fewer parameters, as it facilitates the optimization process.Therefore, despite its simplicity, our basic dual convolution layer network proves to be an efficient choice for the segmentation head in our ScoreSeg model.

VI. DISCUSSION
In this work, we introduce the score-based generative models for self-supervised learning of RSIs and propose an efficient method of feature fusion to utilize attention mechanisms for semantic segmentation, showing promising results on several benchmark datasets with very limited annotated samples.In this section, we further discuss the potential and some limitations of utilizing the score-based generative models on RSIs.
As shown in  also achieve satisfying results.This can be further confirmed from the simple K-Means clustering visualization in Fig. 9, it appears that the model has learned semantic and localized features that can group objects in RSIs coherently.Thus, it is possible to guide the model through semantic segmentation in a weakly supervised way, which is an appealing task in computer vision recently [62], [63].For example, semantic guidance through text signals, i.e., the description of images, can be achieved with existing image-retrieval datasets of RSIs.The realization of language-driven semantic segmentation of RSIs will be of high value because it can not only get rid of the expensive labeling of RSI masks but also conduct reasoning in open scenes without being restricted by the limited categories of datasets during training.As we demonstrated in Section V-D3, score-based models can adapt well to mixed domains in the real world during self-supervised pretraining, and existing research in the field of computer vision has proven that score-based models or called diffusion models can scale up well with more parameters and data [18], [23].Therefore, we believe that score-based models have the potential to be applied to achieve semantic segmentation of RSIs under weak supervision.
However, we also point out some limitations of the current application of score-based models to RSIs.Since RSIs are very diverse in the wild and usually have high resolution, pretraining score-based models on RSIs require a large number of parameters (around 393 M for our 256 × 256 model) and a large amount of computation, which also leads to a longer training time and a slower inferring speed.Therefore, it is not suitable for downstream tasks that require real-time performance such as real-time object detection.Besides, for most SSL methods, we usually continue to fine-tune the backbone with a smaller learning rate rather than fix its weight directly, but the expensive cost of heavy models also makes it difficult to train the scorebased models with specific auxiliary modules (e.g., the feature fusion modules and segmentation head in this work) end-to-end on downstream tasks.Nevertheless, there are some related improvements in the field of computer vision, such as designing more efficient network structures [23] or work in a compressed latent space instead of the image itself [26].In addition, further refinements might be possible if the characteristics of RSIs such as rich diversity and high resolution are considered and designed specifically.

VII. CONCLUSION
We extend SSL methods of RS by score-based generative modeling, which can be well transferred to downstream tasks.Furthermore, considering the time-dependence of score-based models and sparse representations of RSIs from the implemented UNet architecture, we design both multitimesteps and multiscale fusion modules that enable better performance on semantic segmentation of RSIs.Experiments on several datasets prove that ScoreSeg not only exceeds segmentation models pretrained using the existing SOTA SSL methods but also outperforms the model with supervised pretraining on ImageNet.Empirically, we show that pretraining score-based models on mixed domains do not impair the performance of a local domain on the downstream semantic segmentation task.In practice, this is very useful because it makes it possible to collect a large number of off-the-shelf RSIs without having to filter them to ensure they share similar characteristics like resolution, time, etc.
Our work, ScoreSeg, marks a significant stride towards incorporating score-based models in the semantic segmentation of RSIs.This effort underscores an array of promising avenues for future investigations.Primary among these is the potential use of advanced fine-tuning techniques such as LoRA [64].This approach can feasibly facilitate end-to-end training, thus potentially yielding superior performances compared to scenarios where the weights of score-based models are frozen.Moreover, as discussed in Section VI, ScoreSeg can potentially be extended to weakly supervised signals, including text.Given the existence of large-scale image retrieval datasets for RSIs, the application of ScoreSeg in weakly supervised open vocabulary semantic segmentation of RSIs, under mere image-level text guidance, is theoretically viable.Lastly, apart from semantic segmentation, the potent representation capabilities of score-based models could enhance a range of remote sensing tasks.Typical tasks that require pixelwise dense predictions, such as land cover classification, change detection [25], and super-resolution, are key beneficiaries.Additionally, score-based models could potentially pave the way for more advanced and flexible query modeling, including tasks like object detection [65].Continued exploration in these areas promises a fruitful future for score-based models in remote sensing.

Fig. 1 .
Fig. 1.Representations extracted from pretrained backbones.The resolution decreases from left to right.(a) Image and groundtruth.(b) Representations from ResNet152.Although the very deep ResNet152 pretrained on ImageNet in a supervised manner is generally thought to learn good semantic information, those image-level features can not well represent sparse ground objects of RSIs.For example, due to the influence of light, the black car in the shadow is very easy to be confused with the background, whose features are difficult to be captured by neural networks.(c)-(e) Representations from score-based models.Timestep increases from top to bottom.The score-based model is pretrained on Sentinel-2 [16] patches without any supervision but can extract rich semantic representations that proven to be still efficient without any further fine-tuning on downstream segmentation tasks.The representations from different timesteps can complement each other, which helps train a more robust segmentation model.

Fig. 2 .
Fig. 2. Overview of the proposed method.(a) Unsupervised pretraining of a score-based model via denoising score matching.(b) Disturb the input image x 0 with Gaussian noise q 0t (x t | x 0 ) on multiple timesteps t = {t 1 , t 2 , . .., t N } and send them to the frozen UNet network.(For simplicity, only two timesteps are illustrated here.)(c) Gather the feature maps from the UNet decoder and do feature fusion on multitimesteps and multiresolution.(d) Take the fused feature map (with the same resolution as the input image), and send it to the segmentation head to obtain segmentation prediction maps.

Fig. 4 .
Fig. 4. Pixel proportion of each category in the dataset annotations, class "Others" including the annotated edge and the padding of the image segmentation processing.(a) Potsdam dataset.(b) Vaihingen dataset.(c) DeepGlobe dataset.
specific, 1%, 5%, and 10% amount of the original training datasets are utilized separately, e.g., 132 images (1%) are used as a training set for the Potsdam dataset (as the default setup if not mentioned specifically in later experiments).With the frozen score-based model as the backbone, we train ScoreSeg for 200 epochs.Due to the expensive cost of inferring of SOTA score-based model (around 11 GB memory for one image), we use a batch size of 1 in our experiments.Empirically, a smaller batch size has little influence on dense prediction tasks such as segmentation and we also indicate that an alternative is to cache the feature maps of each image to disk in advance.We use the AdamW optimizer and the initial learning rate is set as 0.0001 with the linear decay schedule for all trainable modules.

Fig. 5 .
Fig. 5. Segmentation results with different fine-tuning epochs on the Potsdam dataset using 1% labeled data.

Fig. 8 .
Fig. 8. Synthesis images from the score-based model pretraining on different datasets.(a) Pretraining on the Potsdam dataset.(b) Pretraining on the mixed datasets and imitating Potsdam style.(c) Pretraining on the DeepGlobe dataset.(d) Pretraining on the mixed datasets and imitating DeepGlobe style.(We call it imitating because we cannot control the image style, i.e., unconditional generation.).

Fig. 9 .
Fig. 9. Visualization of K-Means clustering.From left to right are images, annotations, and results from the pretrained ResNet152 and score-based models separately.We select the feature maps with 4 × downsampling times and timestep t = 150 for our ScoreSeg.
Through pretraining on a large number of easily accessible unlabeled RSIs, a score-based generative model is used to extract rich representations of RSIs for downstream semantic segmentation.To our best knowledge, this is the first time that the score-based model has been used for semantic segmentation of RSIs.

TABLE I RESULTS
OF SEMANTIC SEGMENTATION ON THE POTSDAM DATASET USING 1%, 5%, AND 10% LABELS

TABLE III RESULTS
OF SEMANTIC SEGMENTATION ON THE VAIHINGEN DATASET USING 10% LABELS; IOU FOR EACH CLASS IS REPORTED

Table VI ,
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE IV RESULTS
OF SEMANTIC SEGMENTATION ON THE DEEPGLOBE DATASET USING 5% LABELS; IOU FOR EACH CLASS IS REPORTED TABLE V ABLATION STUDIES OF DIFFERENT FUSION MODULES AND TIMESTEPS ON THE POTSDAM DATASET TABLE VI ABLATION STUDIES OF DIFFERENT FUSION MODULES WITH SINGLE-TIMESTEP ON THE POTSDAM DATASET

TABLE VII RESULTS
OF SEMANTIC SEGMENTATION ON THE POTSDAM DATASET USING MIXED DOMAINS PRETRAINING

TABLE VIII ABLATION
STUDIES OF DIFFERENT SEGMENTATION HEADS ON THE POTSDAM DATASET Table V, with a frozen score-based model to extract representations, a vanilla linear classification model can Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.