Journals & Magazines >IEEE Transactions on Geoscien... >Volume: 63

RingMoGPT: A Unified Remote Sensing Foundation Model for Vision, Language, and Grounded Tasks

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Recently, multimodal large language models (MLLMs) have shown excellent reasoning capabilities in various fields. Most of the existing remote sensing (RS) MLLMs solve ima...Show More

Metadata

Abstract:

Recently, multimodal large language models (MLLMs) have shown excellent reasoning capabilities in various fields. Most of the existing remote sensing (RS) MLLMs solve image-level text generation problems (e.g., image captioning), but ignore the core issues of object-level recognition, location, and multitemporal changes in the field of RS. In this article, we propose RingMoGPT, a multimodal foundation model that unifies vision, language, and localization. Based on the idea of domain adaption, RingMoGPT can complete training by fine-tuning only a few parameters. To make the model capable of object detection and change captioning, we further propose a location- and instruction-aware querying transformer (Q-Former) and a change detection module, respectively. To improve the performance of RingMoGPT, we carefully design the pretraining dataset and the instruction-tuning dataset. The pretraining dataset contains over a half million high-quality image and text pairs, which are generated through a low-cost and efficient data generation paradigm. The instruction-tuning dataset contains more than 1.6 million question-answer pairs, including six downstream tasks: scene classification, object detection, visual question answering (VQA), image captioning, grounded image captioning, and change captioning. Our experiments show that RingMoGPT performs well on six tasks, especially its ability to analyze multitemporal data changes and identify dense objects. We also verified the model under a zero-shot setting, and the results show that the proposed RingMoGPT also has good generalization ability in the face of new data.

Published in: IEEE Transactions on Geoscience and Remote Sensing ( Volume: 63)

Article Sequence Number: 5611320

Date of Publication: 04 December 2024

ISSN Information:

DOI: 10.1109/TGRS.2024.3510833

Funding Agency:

Contents

References is not available for this document.

RingMoGPT: A Unified Remote Sensing Foundation Model for Vision, Language, and Grounded Tasks

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

RingMoGPT: A Unified Remote Sensing Foundation Model for Vision, Language, and Grounded Tasks

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Authors

Figures

References

Keywords

Metrics

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?