I. Introduction
Remote sensing (RS) is the science and technology of acquiring and recording information about physical objects, areas, or phenomena from a distance, typically from aircraft or satellites [1]. In recent years, the rapid developments of RS applications in the fields of agriculture, forestry, environment, and urban planning, have led to an explosive growth of RS image data [2]. However, in contrast to the RS image acquisition ability, the RS data processing capability is very weak and easily falls into the paradox of “big data, little knowledge” [3], [4]. To obtain valuable knowledge from enormous RS data in a convenient and flexible way, effective automatic RS cross-modal retrieval methods using different modal information are increasingly needed [5]. Due to its advantages of fast extraction and filtration of image information and flexible human–computer interaction, RS cross-modal text-image retrieval (RSCTIR) [6], [7] provides a possibility to explore the growing amount of cross-modal RS data, and becomes an emerging research area at the intersection of natural language processing (NLP) and computer vision (CV).