Journals & Magazines >IEEE Transactions on Geoscien... >Volume: 62

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

General-purpose foundation models have led to recent breakthroughs in artificial intelligence (AI). In remote sensing, self-supervised learning (SSL) and masked image mod...Show More

Metadata

Abstract:

General-purpose foundation models have led to recent breakthroughs in artificial intelligence (AI). In remote sensing, self-supervised learning (SSL) and masked image modeling (MIM) have been adopted to build foundation models. However, these models primarily learn low-level features and require annotated data for fine-tuning. Moreover, they are inapplicable for retrieval and zero-shot applications due to the lack of language understanding. To address these limitations, we propose RemoteCLIP, the first vision-language foundation model for remote sensing that aims to learn robust visual features with rich semantics and aligned text embeddings for seamless downstream application. To address the scarcity of pretraining data, we leverage data scaling which converts heterogeneous annotations into a unified image-caption data format based on box-to-caption (B2C) and mask-to-box (M2B) conversion. By further incorporating unmanned aerial vehicle (UAV) imagery, we produce a

$12\times$ larger pretraining dataset than the combination of all available datasets. RemoteCLIP can be applied to a variety of downstream tasks, including zero-shot image classification, linear probing, k-NN classification, few-shot classification, image-text retrieval, and object counting in remote sensing images. Evaluation of 16 datasets, including a newly introduced RemoteCount benchmark to test the object counting ability, shows that RemoteCLIP consistently outperforms baseline foundation models across different model scales. Impressively, RemoteCLIP beats the state-of-the-art (SOTA) method by 9.14% mean recall on the RSITMD dataset and 8.92% on the RSICD dataset. For zero-shot classification, our RemoteCLIP outperforms the contrastive language image pretraining (CLIP) baseline by up to 6.39% average accuracy on 12 downstream datasets.

Published in: IEEE Transactions on Geoscience and Remote Sensing ( Volume: 62)

Article Sequence Number: 5622216

Date of Publication: 18 April 2024

ISSN Information:

DOI: 10.1109/TGRS.2024.3390838

Funding Agency:

Contents

I. Introduction

Foundation models [1] are becoming increasingly important in the field of artificial intelligence (AI). Compared to small, specialized models tailored for specific tasks or domains, “one-for-all”-style general-purpose foundation models typically exhibit superior capabilities and generalization abilities in a wide range of downstream tasks. Numerous foundation models have emerged in recent years, such as SimCLR [2], mean absolute error (MAE) [3], and segment anything model (SAM) [4] for computer vision, Bidirectional Encoder Representations from Transformers (BERT) [5] and generative pre-trained transformer (GPT) series [6], [7] for natural language processing, also contrastive language image pretraining (CLIP) [8] and Flamingo [9] for vision-language.

References is not available for this document.

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

Authors

Figures

References

Citations

Keywords

Metrics

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?