Abstract:
Recently, multimodal large language models (MLLMs) have shown excellent reasoning capabilities in various fields. Most of the existing remote sensing (RS) MLLMs solve ima...Show MoreMetadata
Abstract:
Recently, multimodal large language models (MLLMs) have shown excellent reasoning capabilities in various fields. Most of the existing remote sensing (RS) MLLMs solve image-level text generation problems (e.g., image captioning), but ignore the core issues of object-level recognition, location, and multitemporal changes in the field of RS. In this article, we propose RingMoGPT, a multimodal foundation model that unifies vision, language, and localization. Based on the idea of domain adaption, RingMoGPT can complete training by fine-tuning only a few parameters. To make the model capable of object detection and change captioning, we further propose a location- and instruction-aware querying transformer (Q-Former) and a change detection module, respectively. To improve the performance of RingMoGPT, we carefully design the pretraining dataset and the instruction-tuning dataset. The pretraining dataset contains over a half million high-quality image and text pairs, which are generated through a low-cost and efficient data generation paradigm. The instruction-tuning dataset contains more than 1.6 million question-answer pairs, including six downstream tasks: scene classification, object detection, visual question answering (VQA), image captioning, grounded image captioning, and change captioning. Our experiments show that RingMoGPT performs well on six tasks, especially its ability to analyze multitemporal data changes and identify dense objects. We also verified the model under a zero-shot setting, and the results show that the proposed RingMoGPT also has good generalization ability in the face of new data.
Published in: IEEE Transactions on Geoscience and Remote Sensing ( Volume: 63)