RingMoGPT: A Unified Remote Sensing Foundation Model for Vision, Language, and Grounded Tasks | IEEE Journals & Magazine | IEEE Xplore

RingMoGPT: A Unified Remote Sensing Foundation Model for Vision, Language, and Grounded Tasks


Abstract:

Recently, multimodal large language models (MLLMs) have shown excellent reasoning capabilities in various fields. Most of the existing remote sensing (RS) MLLMs solve ima...Show More

Abstract:

Recently, multimodal large language models (MLLMs) have shown excellent reasoning capabilities in various fields. Most of the existing remote sensing (RS) MLLMs solve image-level text generation problems (e.g., image captioning), but ignore the core issues of object-level recognition, location, and multitemporal changes in the field of RS. In this article, we propose RingMoGPT, a multimodal foundation model that unifies vision, language, and localization. Based on the idea of domain adaption, RingMoGPT can complete training by fine-tuning only a few parameters. To make the model capable of object detection and change captioning, we further propose a location- and instruction-aware querying transformer (Q-Former) and a change detection module, respectively. To improve the performance of RingMoGPT, we carefully design the pretraining dataset and the instruction-tuning dataset. The pretraining dataset contains over a half million high-quality image and text pairs, which are generated through a low-cost and efficient data generation paradigm. The instruction-tuning dataset contains more than 1.6 million question-answer pairs, including six downstream tasks: scene classification, object detection, visual question answering (VQA), image captioning, grounded image captioning, and change captioning. Our experiments show that RingMoGPT performs well on six tasks, especially its ability to analyze multitemporal data changes and identify dense objects. We also verified the model under a zero-shot setting, and the results show that the proposed RingMoGPT also has good generalization ability in the face of new data.
Article Sequence Number: 5611320
Date of Publication: 04 December 2024

ISSN Information:

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.