Fine-Grained Abandoned Cropland Mapping in Southern China Using Pixel Attention Contrastive Learning

Cropland abandonment has multifaceted and controversial impacts on the natural environment and socioeconomic development. Utilizing remote sensing data offers the potential for comprehensive coverage and large-scale insights into automated abandoned cropland identification. However, accurately capturing small abandoned cropland, particularly in regions, such as southern China, with fragmentized farmland, poses a significant challenge using the traditional optical image-based mapping methods due to their coarse spatial resolution. In addition, irregular and chaotic textures of abandoned cropland further complicate the accurate prediction using very high resolution (VHR) data. In this article, we propose a novel deep learning network termed pixel attention contrastive network (PACnet) to map fine-grained abandoned cropland based on VHR data. Cross-image pixel contrast learning is introduced to discern distinctive features distinguishing abandoned cropland from other land types across various interimages. Moreover, a criss-cross attention module is embedded to enhance the contrasting characteristics within individual intraimages. Experimental outcomes validate the efficacy of PACnet, showcasing the highest accuracy (OA = 93.8% and mIOU = 71.7%) when compared with classical semantic segmentation networks. Our proposal not only underscores the potency of VHR remote sensing data in finely delineating abandoned cropland but also carries significant implications for cropland abandonment impact analysis and informed policy formulation.


Fine-Grained Abandoned Cropland Mapping in Southern China Using Pixel Attention Contrastive Learning
Haoyang Li , Haomei Lin, Junshen Luo , Teng Wang , Hao Chen, Qiuting Xu, and Xinchang Zhang Abstract-Cropland abandonment has multifaceted and controversial impacts on the natural environment and socioeconomic development.Utilizing remote sensing data offers the potential for comprehensive coverage and large-scale insights into automated abandoned cropland identification.However, accurately capturing small abandoned cropland, particularly in regions, such as southern China, with fragmentized farmland, poses a significant challenge using the traditional optical image-based mapping methods due to their coarse spatial resolution.In addition, irregular and chaotic textures of abandoned cropland further complicate the accurate prediction using very high resolution (VHR) data.In this article, we propose a novel deep learning network termed pixel attention contrastive network (PACnet) to map fine-grained abandoned cropland based on VHR data.Cross-image pixel contrast learning is introduced to discern distinctive features distinguishing abandoned cropland from other land types across various interimages.Moreover, a criss-cross attention module is embedded to enhance the contrasting characteristics within individual intraimages.Experimental outcomes validate the efficacy of PACnet, showcasing the highest accuracy (OA = 93.8% and mIOU = 71.7%)when compared with classical semantic segmentation networks.Our proposal not only underscores the potency of VHR remote sensing data in finely delineating abandoned cropland but also carries significant implications for cropland abandonment impact analysis and informed policy formulation.

I. INTRODUCTION
A BANDONED cropland represents a form of land use characterized by marginalization arising from inadequate suitability and economic viability [1], [2], [3].The abandonment of arable land has multifaceted and profound effects on factors, such as soil erosion, biodiversity, carbon storage, and the development of the agricultural economy [4], [5], [6], [7].In China, the matter of abandoned cropland has garnered significant attention, primarily driven by concerns over food security, particularly in the economically developed regions of southern China [3], [8], [9].Precise mapping of abandoned cropland is essential for analyzing the driving factors contributing to its occurrence and understanding its impact on the natural environment and socioeconomic aspects.In contrast to less efficient field-based research, remote sensing (RS) technology provides a more convenient and expeditious means of mapping largescale land use types, and it can also be applied effectively for monitoring abandoned cropland [10].
Existing mapping methods employing coarse images (>= 10 m) frequently encounter challenges posed by mixed pixels at the edge of small parcels, rendering the accurate depiction of small abandoned cropland particularly challenging, notably in southern China.Owing to physical geographical conditions and historical factors, cropland parcels in Southern China exhibit fragmentation and marginalization [19].In contrast, very high resolution (VHR) images (<= 1 m) can more effectively identify small and fragmented parcels.Therefore, the utilization of VHR images is imperative for acquiring finely detailed maps of abandoned land.Constrained by data availability, VHR images with limited temporal series density pose challenges for conducting temporal and spectral analysis.Thus, the key to mapping abandoned cropland at the submeter level lies in the profound exploration of fine-grained information and visual features within a single-phase VHR image.
Recently, methods based on deep learning (DL) have demonstrated their efficacy in VHR image interpretation [20], [21].In contrast to traditional texture modeling methods, DL networks exhibit a superior capacity to harness spatial-context information in VHR images, offering an enhanced depiction of surface details and intricate spatial information [22], [23].The extraction of deeper texture features and semantic information through the layers of deep neural networks effectively captures the visual target information within VHR images [24].Numerous DL networks have been proposed for the fine-grained extraction of diverse land cover types from VHR images [25], [26].An increasing number of DL frameworks are under development to enhance the capability to perceive complex VHR semantic information for specific tasks, including change detection [27], building extraction [28], tree crown mapping [29], etc. DL methods can yield improved results by analyzing specific objectives and adjusting DL modules.Critical directions for DL network design and improvement include feature enhancement [30], [31] and feature fusion [22], [32].In addition, incorporating graph structures constitutes a novel approach for mining topological information [33] and distilling contextual information [34].The powerful learning capability of DL in the context of fine-grained high-resolution landscapes offers a viable avenue for mapping VHR abandoned cropland.
Abandoned cropland exhibits distinct visual discriminative features in VHR images, yet its fuzzy and amorphous characteristics present a significant challenge for DL networks.Unlike cultivated farmland, abandoned cropland in VHR images exhibits distinct fuzzy texture features.It lacks signs of artificial cultivation and typically appears as grassland and shrub characteristics.Contrasted with the orderly patterns found in neighboring farmland and orchards, abandoned cropland displays conspicuous chaotic and disordered texture characteristics, as illustrated in Fig. 1.This forms a clear visual foundation for identifying abandoned cropland.DL techniques can effectively train and predict by extracting the distinct visual textures of the target areas.
Nonetheless, the fuzzy and uncertain textures of abandoned cropland continue to present challenges for accurate identification by DL architectures.Indeed, DL algorithms excel in recognizing objects with well-defined boundaries and textures (e.g., buildings and roads) rather than natural amorphous regions characterized by fuzzy edges and intraclass variations (e.g., agricultural areas) [35], [36].Abandoned cropland exhibits more pronounced amorphous characteristics than typical natural features, posing significant challenges for fine-grained mapping.Shen et al. [37] conducted experimental work to map abandoned cropland in VHR images, introducing a neural network with a texture calculation module.Although texture enhancement learning can bolster the context-awareness capability of DL, the capacity for perceiving textures in abandoned cropland remains inadequate.Consequently, tackling the challenge of perceiving the amorphous features of abandoned cropland remains a paramount concern.
Driven by the pronounced distinctions between abandoned cropland and well-maintained cropland, our approach centers on contrasting the heterogeneity between cultivated and abandoned areas rather than directly identifying the unclear features.To accentuate the distinctive features, we have developed a pixel attention contrastive network (PACnet) designed to capture the differentiated features of both inter and intraimages.In PACnet, we employ cross-image pixel contrastive learning (CPCL) to analyze distinctive features among interimages.Subsequently, a criss-cross attention module (CCAM) is applied to bolster the capacity to capture disparities among neighboring regions and highlight the contrast between abandoned cropland and its nearby surroundings.
In a nutshell, our contributions can be listed as follows.
1) We construct a VHR abandoned cropland dataset (VACD) exceeding 14 000 samples for DL network training and propose a DL network called PACnet designed explicitly for extracting fine-grained details of abandoned cropland from VHR (0.5 m) RS images.
2) Confronted with abandoned cropland's ambiguous and disordered visual attributes, we introduce CPCL and CCAM to characterize contrasting features between abandoned cropland and other land categories.3) Our proposed approach yields competitive results on the VACD dataset, affirming the viability and substantial potential of VHR abandoned cropland mapping.

A. Semantic Segmentation in RS
Semantic segmentation is the foundational task in the computer vision field, which refers to labeling all the pixels in images.The proposal of fully convolutional networks (FCNs) [38] marked a significant milestone in image segmentation.FCN, employing deconvolution in place of a fully connected layer, enables processing input images of any size and predicting every pixel within them.Recent research endeavors [39], [40], [41], [42], [43], [44], [45] in the domain of semantic segmentation fall into two primary categories.One aims to expand the receptive field and facilitate multiscale context extraction, while the other focuses on the incorporation of attention modules.Unet [39] exemplifies the former category, characterized by its distinctive asymmetric "U" shape structure.U-Net merges low-and high-level features via skip connections, retaining some edge characteristics.
Furthermore, the utilization of atrous convolution [40] constitutes another representative approach.By incorporating a fully connected conditional random field (CRF) and atrous spatial pyramid pooling (ASPP), DeepLab series [41] attains enhanced representation capability and more precise multiscale object segmentation.The introduction of attention mechanisms represents another crucial technique for improving DL performance.These mechanisms excel in extracting contextual importance by calculating correlations among instances.The convolutional block attention module [42] and dual attention module [43] serve as typical examples of such attention modules.In addition, leveraging attention mechanisms, the transformer-based architecture has introduced a new perspective to the computer vision field.Milestone works in this regard include the vision transformer [44] and the segmentation transformer [45], which have significantly advanced the utilization of transformers in semantic segmentation.
The rapid advancements in DL offer a novel perspective for RS image classification, akin to the principles underlying semantic segmentation in the computer vision field.In contrast to natural images, RS images captured by satellites and aircraft are susceptible to various factors, including lighting and photography angles [46].Hence, incorporating spatially contextual information is essential in RS image segmentation.In recent years, extensive research efforts [18], [47], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60] have focused on enhancing pixel-level accuracy in semantic segmentation of RS images.These efforts can be categorized into three primary strategies: multiscale contextual information extraction, information fusion, and postprocessing techniques [24].Zhao and Du [47] employed a multiscale convolutional neural network to extract deep spatial information from hyperspectral images.ScasNet was developed to capture multiscale contexts on the encoder output [48].MC-FCN [47] applied additional constraints to intermediate layers, thereby enhancing its multiscale feature representations and improving building segmentation accuracy.MGSNet [58] extracted background information surrounding the target to improve sample distinguishability.Li et al. [59] proposed multiscale split attention to acquire more detailed representations through grouping.ACAHNet [55] utilizes the asymmetric multiheaded cross-attention module to enhance the contextual features extracted by both CNN and transformer network.Information fusion is another integral aspect of the DL network.Marmanis et al. [49] extracted spectral and digital elevation model information from two channels, and a convolution layer combines the results from both channels.Wang et al. [53] introduced a gated convolutional neural network for selecting adaptive features during the fusion of different layer features.AERNet [56] employed a contextual feature aggregation module to fuse information from different context features.SSPN [57] applied multiscale interfusion to enrich the extracted features and improve the sensitivity of the spectral-spatial information.Wang et al. [60] designed a global dependence fusion module to fuse features extracted from hyperspectral and SAR images.Postprocessing techniques, such as simple linear iterative clustering superpixel segmentation [50] and CRFs [51], are commonly applied to refine the RS image segmentation results.Additional postprocessing techniques, such as the integration of point clouds and high-resolution images, mitigate the salt-andpepper noise in classification results [54].

B. Contrastive Learning
Categorized by label availability, contrastive learning can be grouped into unsupervised and supervised contrastive forms.In the realm of unsupervised contrastive learning (UCL), pioneering studies [61], [62] laid the foundation by introducing pretext tasks and defining positive/negative samples.Wu et al. [61] introduced instance discrimination as the pretext task for UCL.Ye et al. [62] defined positives as varying augmentation outcomes from a single image, considering other images and their augmentations in the dataset as negatives.Nearly all studies [63], [64] find that the size of the negative sample collection dramatically influences the performance of UCL.However, the challenge of designing sample collections that balance computational efficiency and UCL performance persists.Subsequently, milestone work-MoCo [63] and simCLR [64]-were proposed to solve the above problem.MoCo introduced a queue structure and momentum encoder to create a comprehensive and coherent sample collection.SimCLR, on the other hand, discarded conventional data containers in favor of memory banks and raised projection heads to outperform prior self-supervised methods significantly.Most UCL serves as a pretraining step for the downstream task, especially for the classification task.The performance of UCL pretraining on dense work, such as semantic segmentation and object detection, is unsatisfactory [65].DenseCL [66] and VICRegL [67] were developed to address this issue.However, while they demonstrate effectiveness on natural image datasets, their performance on RS image datasets warrants further improvement.Concerning supervised contrastive learning (SCL), it primarily incorporates the concept of CL to amplify representational capacity and regularize the embedding space.Zhao et al. [68] defined pixels belonging to the same class in other images as additional positive samples.The introduction of these more challenging positives directed the network to cluster pixels of the same class.The model was initially trained using pixelwise label-based contrastive loss and, subsequently, fine-tuned with pixelwise cross-entropy loss for semantic segmentation.Chaitanya et al. [69] employed the global contrastive loss to enhance imagelevel representative capacity and the local contrastive loss to distinguish adjacent regions.These studies leverage both global and local context at the image level while striving to extract distinctive pixel-level features.

III. METHODOLOGY
This section introduces the dataset prepared for experiments and details the proposed PACnet.The dataset prepared for experiments is presented in Section III-A.Then, we briefly introduce the PACnet framework in Section III-B.In Section III-C, CPCL is proposed for enhancing contrasts interimages, and in Section III-D, CCAM is embedded for exploring contrasts intraimages.Finally, we introduce the loss function of PACnet.

A. Dataset Preparation
The VACD is annotated on Google Earth VHR images (0.5 m) obtained in 2022 in Guangdong Province, China.We label the abandoned cropland through human visual interpretation.As shown in Fig. 1, abandoned cropland and cultivated farmland are similar in spectral as vegetation but quite different in texture information.The cultivated fields have neater and more regular textures, while the textures of abandoned cropland are significantly irregular and messy.Abandoned cropland filled with ruderal is often located in monticules and depressions.Since weeds often overgrow shrubs, the surrounding shrubs hinder the extraction of abandoned farmland.
We crop the complete scene of images into 512×512 patches and randomly divide them into a train collection of 10 608 patches, a validation collection of 2653 patches, and a test collection of 1474 patches.Some patches of the detailed VACD are shown in Fig. 2.

B. PACnet Architecture
According to the characteristics of abandoned cropland and the surrounding surface features, we propose PACnet as Fig. 3 to enhance the comparable features from interimages and intraimages.We introduce CPCL to focus on the overall and global feature contrast.We cast CPCL as a dictionary query task [52].The target pixel for prediction is seen as a query, and the search range containing samples (positive and negative) is similar to a dictionary with keys.CPCL calculates the contrastive loss between the selected pixel(query) embeddings and other pixel The other branch, CPCL branch, is to pass P through a projection head, which is composed of two 1 × 1 convolution layers with ReLU.The projection head maps every high-level pixel feature p ∈ P into a 256-dimension  2 -normalized feature vector, making preparations for the calculation of contrastive loss.The projection head applied here is only complemented in the training process and is eliminated in the inference section.The contrastive loss is later computed between the query and keys selected from the memory bank.The memory bank contains the pixel and region embeddings, and the region embeddings are calculated by image projected feature P and corresponding labels.

C. Cross-Image Pixel Contrastive Learning (PCL)
This section provides a detailed introduction to CPCL in PACnet.
1) Pixel Contrastive Learning: Unlike classical image contrastive learning (ICL), CPCL is a kind of PCL, a supervision algorithm.The brief frameworks of ICL and PCL are shown in Fig. 4. ICL conducts CL by using different data augmentation of one image and finally implementing the features from the output of projections.In contrast, PCL performs contrastive feature mining at the pixel level and mines fine-grained features.For RS images, the details in a scene are often too complex to clarify  the semantics for ICL, making pixel-level contrast as PCL more meaningful.
2) Pixel-to-Pixel and Pixel-to-Region Contrast: CPCL is introduced to explore the significantly different texture information between abandoned cropland and other land types.Through pixel-to-pixel and pixel-to-region contrasting, CPCL regularizes the embedding space by shortening the distance between the same class features while lengthening the different class features' distance.Both pixel embeddings and region embeddings are stored in a memory bank ℬ.The details of CPCL are shown in Fig. 5.
As for pixel-to-pixel contrast, given that pixel p in training images is the query with the semantic label c, then the positive samples here are other pixels with the same label, while the negative samples are the pixels not belonging to c.The positive and negative samples mentioned above as keys are not restricted to being selected from the same image.
For pixel-to-region contrast, it is proposed to supplement the image content information lost during the downsample process.Concerning pixel p labeled c as a query, the positive samples are the c class semantic regions in all images and the negative ones are the \c classes semantic regions in the dataset.
During training, we select queries by the "hard segmentation sampling" strategy [70] and keys by the "harder example sampling" strategy [70], [71], [72], [73].For the former, half of the queries are chosen randomly, and half are sampled from the harder queries.The harder queries here are the pixels with the wrong prediction in the segmentation task (i.e., c = c).This strategy guides the CPCL to focus on the pixels that make it difficult for the network to predict and intensify the critical feature generation.As for key selection, we use the "harder example sampling" strategy.For each query embedding p, we select the top 10% harder negatives from memory bank ℬ as negative collection, and positives are the same.The definition of "harder" here relates to the computation of the designed contrastive loss, and we will further explain it in Section III-C4.Then, we randomly sample K negative/positive embeddings from the respective collection to compute the designed contrastive loss ℒ NCE .K denotes the number of samples here.
3) Memory Bank: Our designed memory bank ℬ aims to balance training efficiency and representative capacity.The memory bank contains pixel and region embeddings.For pixel embeddings, a pixel queue with size T is stored for each category.The pixel embeddings are contained in ℬ with a size of || × T × D and part of them (V /T ) are dynamically updated by the recent batch.That is, during training, only a few pixels (i.e., V , T V ) are selected from the images in the latest batch Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
and pulled into the queue.The above design guarantees pixel embeddings' consistency and time efficiency in the memory bank.For region embeddings, providing that we have a segmentation dataset with N images and ||classes, the keys for pixel-to-region contrast are region embeddings with size || × N × D, where D is the dimension of the pixel embeddings.The (c, n)th element of the region embeddings is the feature vector calculated by the average pooling of every pixel embedding with the same c class in the nth image.Therefore, the total size of the memory bank

4) Loss Function of CPCL:
The InfoCE loss is widely used in UCL, and it can be represented as (1) where v + (v − ) is the embedding of the positive(negative) sample for image P,  P stores the embeddings of negative samples, "•" denotes the dot product, and τ > 0 is a temperature hyperparameter.All the embeddings here are  2 -normalized.
CPCL extends (1) to the supervised dense prediction task to practice the pixel-to-pixel and pixel-to-region contrast mentioned above.It can be defined as And X is defined as follows: The discernibility of training samples is vital in the segmentation task.In our work, the derivation of the contrastive loss (2) with respect to the query embedding p can be given as follows: And Y is defined as follows: where m p+/− ∈ [0, 1] is the matching probability between the key e + /e − and the query p, the computation of the probability can be represented as follows: The dot product of query p and negative e − with a value closer to 1 is deemed to be a sign of a harder negative sample.i.e., the negative key is similar to the query p.Meanwhile, the positive e + with a value closer to −1 is regarded as a harder positive, i.e., the positive key is dissimilar to the query p.

D. Criss-Cross Attention Module
The nearest surface features (e.g., cultivated farmland and shrub) in the current scene of an image contain massive and abundant content information, so we use CCAM to intensify the extraction of contextual importance and local feature representative capacity from two orientations within individual images.And better representative capacity in intraimages can help improve the performance of PACnet.Unlike nonlocal attention modules [74] that calculate all pixel weights directly, CCAM focuses on the pixels in essential directions and dramatically reduces the computed quantity.
The detailed structure of the CCAM is represented in Fig. 6.The CCAM captures contextual information from both lateral and longitudinal directions.For a feature map P with a spatial size of H × W × D, it first passes through two branches with 1 × 1 convolution and is transformed into M ∈ R H×W ×D' and N ∈ R H×W ×D' (D > D'), respectively.Via the affinity computation of M and N , we generate the affinity matrix, that is, the attention map A with a spatial size of (H × W − 1) × (W × H).For each pixel p in the feature M , we can obtain M p ∈ R D' .By extracting the feature vector in the same row/column as p in the feature N , we can acquire Ω p ∈ R (H×W −1)×D' .The affinity matrix is computed as follows: (7) Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
After that, a softmax operation is applied on Z to calculate the final attention map A. Another 1 × 1 filter used on P generates L ∈ R H×W ×D for feature adaptation.For each pixel p in the feature L, we can acquire a feature vector L p ∈ R D and a collection of feature vectors Φ p ∈ R (H+W −1)×D whose position is in the same row/column as p.The horizon and vertical contextual information of p are obtained by aggregation operation represented as follows: where P p is a feature vector with a spatial size of H × W × D for pixel p and A i,p is the correlation value at channel i and pixel p in the attention map A.
In (8), the contextual importance is joined to feature P to enhance the representative capacity.With a broader context extraction perspective and richer context aggregation from attention map A, the PCAnet achieves significant progress and is more robust for the segmentation task.

E. Total Loss Function
Our loss function contains classical segmentation loss and the designed contrastive loss we put forward above.The former allows PACnet to study the discriminative features essential for abandoned cropland classification, and the latter enhances the contrast between abandoned farmland and surrounding ground features (e.g., cultivated cropland and shrubs) by explicitly exploring global semantics between pixel and region samples.
The segmentation loss we use in PACnet is the cross-entropy loss.Given pixel p in the image P is classified into a semantic class c ∈ .The cross-entropy loss ℒ CE can be computed as follows: where 1 T c denotes the one-hot encoding of c, c ∈  represents the label of pixel p, s = [s 1 , s 2 , . . ., s || ] ∈ R || is the unnormalized score vector for pixel p, and s ∈ S. For the softmax optimization, that is The contrastive loss ℒ NCE is computed as (2) between the query embeddings and the key embeddings from the memory bank ℬ.The hard segmentation sampling strategy selects the former, and the harder example sampling strategy samples the latter.Then, the ultimate training loss ℒ Overall is computed as follows: where λ > 0 is the weight of contrastive loss.

A. Experimental Setting
Our VACD dataset contains 14 735 samples with a size of 512 × 512.The size of the train set is 10 608, that of the validation set is 2653, and the test set is 1474.We train the model on the train set for 100 epochs with a batch size of 64.We use the stochastic gradient descent optimizer to optimize the parameters in the model.The initial learning rate is 0.01, the weight decay is 0.0001, and the momentum is 0.9.The temperature τ in (3) is set as 0.

B. Evaluation Metrics
In this study, we use overall accuracy (OA), intersection over union (IoU), recall rate, precision rate, and F1 score to evaluate the effectiveness of all models.In binary classification, true positive (TP) represents the positive pixels in the label correctly classified as positive pixels.True negative (TN) means the negative pixels in the label correctly classified as negative pixels.False positive (FP) represents the negative pixels in the label, which are incorrectly classified as positive pixels.False negative (FN) means the label's positive pixels, which are incorrectly classified as negative pixels.TP, TN, FP, and FN are used to calculate the evaluation metrics.
IoU is a widely used metric in semantic segmentation, which calculates the intersection of label and prediction over the union of label and prediction, indicating the effectiveness of a model at pixel level by the overlap of label and prediction.mIoU is the average of the IoU of every class i. OA shows the overall prediction accuracy.Recall rate indicates the proportion of positive pixels identified in the label, while precision rate suggests the accuracy of all positive pixels in the prediction.F1 score weighs the recall rate and precision rate to represent the overall performance to avoid bias due to sample imbalance.All these five metrics can be calculated as follows: Precision = TP TP + FP ( 16)

C. Comparisons With Other Methods
To better demonstrate the effectiveness of PCAnet, we select the following mainstream segmentation networks for comparison on the VACD dataset and quantify the evaluation results.
Unet++: Unet++ [75] is a semantic segmentation network developed from Unet.It nests encoder and decoder subnetworks to Unet and redesigns the skip-connection module in Unet.By adding the deep supervision mechanism, Unet++ achieves faster model convergence.
DeepLabV3+: DeepLabV3+ [41] is one of the DL networks in the DeepLab series.DeepLabV3+ uses a typical encoderdecoder network structure.The encoder can extract multipleresolution features, and by introducing ASPP, DeepLabV3+ expands the receptive field and enhances the representative capacity.
Pyramid scene parsing network (PSPnet): The critical module of the PSPnet [76] is the pyramid pooling module.It can combine contextual information from diverse regions and obtain a more potent global representative capacity.To some extent, PSPnet solves the problem of mismatch of pixel context, confused semantic labels, and difficulty in small class prediction.
OCRnet: OCRnet [77] is often paired with HRnet as a backbone to obtain high-quality context importance and maintain high-resolution features.OCRnet implements a coarse-to-fine strategy to get a pixel-enhanced object-contextual representation.
Segformer: Segformer [78] is a simple and efficient semantic segmentation network with a transformer framework.Segformer extracts multiscale features by using a hierarchically structured transformer decoder.
It can be seen from Table I that our proposed PACnet achieves the highest OA, mIoU, Precision rate, and F1 score of 93.8%, 71.7%, 70.7%, and 68.3%, respectively.Segformer obtains the highest recall rate of 72.2% and 6.2% higher than our proposed PACnet due to its transformer framework.Although the recall rate of our proposed PACnet is lower than some other competitive models, the precision rate is much higher due to a specific inhibitory effect on noise labels, which will be discussed and analyzed in Section V.Among all these models, our proposed PACnet obtains the best result considering all accuracy metrics comprehensively.
From Fig. 7, we could find that the extraction result of our proposed PACnet is smooth and precise.From the images in Row 4, it is evident that other models cannot extract the target abandoned cropland completely but PACnet does.Besides, the prediction of PACnet is more in line with the actual surface of the label in Fig. 7(b) because of its noise resistance.In conclusion, our proposed PACnet with its promising ability to capture texture information and hidden key features makes fewer mistakes than other models and can extract the complete abandoned cropland in a more complex scene.

D. Ablation Experiments
To represent the influence of CPCL and CCAM, we conduct ablation experiments on the VACD dataset and quantify the evaluation results.First, we carry out the baseline experiment of the initial network with ResNet50 as the encoder and DeepLabV3 as the decoder.Then, we add the CCAM into the baseline model to better extract the intraimage features from different directions.In like manner, we introduce CPCL to the baseline to enhance the interimage feature extraction.Finally, the experiment of the baseline with CPCL and CCAM is conducted.
Table II presents the metric results of our ablation experiments.The "Base" model is the baseline model without any tricks."+CCAM" represents the base model with CCAM."+CPCL" means the base model with CPCL.From Table II, we can find that adding CCAM into the baseline improves by 0.2% in OA and 3.5% in precision rate, which indicates that CCAM pays more attention to contextual importance and local feature representative capacity from two orientations within images to lower the possibility of mistakenly classifying.Moreover, the involvement of CPCL improves the performance of the baseline remarkably in all metrics.Therefore, PACnet with CCAM and CPCL at the same time obtains further improvement compared with the baseline with only a single module.
Fig. 8 further demonstrates the function of CCAM and CPCL.We can find in Fig. 8(c) and (d) that the baseline makes some mistakes in organizing the background as abandoned cropland, but the baseline with CCAM does not.By comparing Fig. 8(d) and (e), we can find that CPCL makes fewer mistakes and tends to extract the abandoned cropland more completely.The prediction results in Fig. 8(f) are significantly closer to the accurate label, which proves the effectiveness of our proposed CCAM and CPCL.We believe that using CCAM and CPCL simultaneously can enhance PACnet's ability to extract comprehensive texture information and essential deep features, leading to improved performance.

A. Noise Suppression Ability of PACnet
Due to the difficulty in ensuring absolute accuracy through manual annotation, labels used for training often have some noise.Indeed, some studies have shown that contrastive learning has a particular antinoise performance for labels with noise [79], [80].The experimental results indicate that the CPCL used in this article has a specific inhibitory effect on noise labels because CPCL smooths out erroneous information in segmentation loss by continuously comparing the similarities and differences of pixels.As shown in Table I, the method proposed in this article has a relatively low recall rate but the highest precision rate.This suggests that CPCL can identify more typical abandoned labels, implying a certain level of noise resistance against incorrect labels.As shown in Fig. 9, for labels that were not accurately annotated during the training process, predicted results could be closer to the actual surface textures.The above findings further demonstrate the inhibitory effect of CPCL on noise labels, which alleviates the considerable cost of fine labeling and is worth further exploration.

B. Pros and Cons of PACnet
As we analyze in Section IV, PACnet can accurately extract fine-grained abandoned cropland from single-time-phase VHR data.Our qualitative and quantitative assessments substantiate that PACnet outperforms mainstream segmentation networks.This success can be attributed to the incorporation of CPCL and CCAM.PACnet distinguishes itself by emphasizing capturing intra-and interimage contrastive features, a critical aspect often overlooked by classical models.This differentiation is particularly significant, given the inherent complexity of directly modeling amorphous abandoned cropland.The integration of CPCL into PACnet enhances its proficiency in discerning different characteristics of abandoned cropland and other land types at both pixel and semantic region levels.Furthermore, introducing CCAM enriches the network's representative capacity, contributing to its superior performance.
Nevertheless, it is imperative to acknowledge a limitation in PACnet's performance.Our sampled experimental area contains a substantial expanse equivalent to that of a province.This coverage confirms PACnet's proficiency in mapping fine-grained abandoned cropland across southern China.However, we remain aware that its efficacy might not be universally consistent when applied to regions characterized by distinct topography and cropland attributes.Addressing this potential deficiency constitutes a crucial direction for our future work, where we intend to prioritize the advancement of PACnet's transfer learning capabilities and will further mine the abundant information of time-series data.

VI. CONCLUSION
In this article, faced with the problems of farmland fragmentation and amorphous characteristics of abandoned cropland in southern China, we proposed a new fine-grained abandoned cropland mapping method (PACnet) based on the pixel-level contrast learning.By integrating CPCL and CCAM, our proposal enhances the comparative characteristics between abandoned land and other land features from inter-and intraimages.The experimental results show that PACnet has the highest accuracy (OA = 93.8% and mIOU = 71.7%) in mapping abandoned cropland compared with classical DL algorithms.We can find that CPCL has a specific inhibitory effect and antinoise performance on inaccurate labels.Our proposed method has vital reference significance for VHR abandoned cropland mapping and analysis research.In the future, we will continue to explore the synergistic use of time-series features and VHR images to map abandoned cropland more accurately.

Fig. 1 .
Fig. 1.Abandoned cropland in VHR images.Compared with the surrounding neat landscape, abandoned cropland presents amorphous and disorderly characteristics in vision.
−log exp (p • e + /τ ) exp (p • e + /τ ) + e − ∈  exp (p • e − /τ ) (3) where   and   denote the positive and negative sample collections stored in the memory bank ℬ for pixel p, e + and e − are the embeddings of positives and negatives, respectively, and p represents the pixel embedding of the query pixel p.

1 .
The weight λ of the contrastive loss ℒ NCE  is 1.The learning rate decay strategy is LambdaLR with a step size of 100 and gamma of 0.5.Random horizontal flips and brightness are used to intensify the model's generalization for data augmentation.The probability of the image being flipped is 0.5.All training images are brightened, and the shift value is 10.The validation and test datasets do not have any augmentation operations.Our models and experiments are implemented by the open-source DL framework Pytorch.We train the model by the Distributed DataParallel strategy.The experimental environment is Centos 7.5.1804.The GPU is GeForce RTX 2080ti.The CPU is Intel(R) Xeon(R) CPU E5 2680.