Hypergraph-Enhanced Textual-Visual Matching Network for Cross-Modal Remote Sensing Image Retrieval via Dynamic Hypergraph Learning

Cross-modal remote sensing (RS) image retrieval aims to retrieve RS images using other modalities (e.g., text) and vice versa. The relationship between objects in the RS image is complex, i.e., the distribution of multiple types of objects is uneven, which makes the matching with query text inaccurate, and then restricts the performance of remote sensing image retrieval. Previous methods generally focus on the feature matching between RS image and text and rarely model the relationships between features of RS image. Hypergraph (hyperedge connecting multiple vertices) is an extended structure of a regular graph and has attracted extensive attention for its superiority in representing high-order relationships. Inspired by the advantages of the hypergraph, in this work, a hypergraph-enhanced textual-visual matching network (HyperMatch) is proposed to circumvent the inaccurate matching between the RS image and query text. Specifically, a multiscale RS image hypergraph network is designed to model the complex relationships between features of the RS image for forming the valuable and redundant features into different hyperedges. In addition, a hypergraph construction and update method for an RS image is designed. For constructing a hypergraph, the features of an RS image running as vertices and cosine similarity is the metric to measure the correlation between them. Vertex and hyperedge attention mechanisms are introduced for the dynamic update of a hypergraph to realize the alternating update of vertices and hyperedges. Quantitative and qualitative experiments on the RSICD and RSITMD datasets verify the effectiveness of the proposed method in cross-modal remote sensing image retrieval.


I. INTRODUCTION
W ITH the development of remote sensing (RS) information acquisition technology, the number of remote sensing images has increased exponentially. The collected remotesensing images have diverse scenes and different types of objects. In addition, the resolution of remote sensing images varies significantly due to the different standards of sensors. Therefore, managing massive remote sensing images is complex, and it is urgent to update remote sensing retrieval technology. Crossmodal remote sensing image aims to use other modes such as text as queries to retrieve remote sensing images. With its flexible form, it has become a research hotspot in the field in recent years.
Previous methods generally generate textual descriptions and then retrieve remote sensing images by measuring the matching degrees between query text and the textual descriptions [1], [2], [3]. These methods are essentially text-to-text retrieval, which ignores the direct matching between remote sensing image and query text and are susceptible to the quality of generated textual description. To avoid the disadvantages of two-stage retrieval, Yuan et al. [4] propose an end-to-end retrieval method to directly learn the matching degree between query text and remote sensing image.
Although the above methods have promoted the development of cross-modal remote sensing image retrieval and aroused widespread concern in the industry, they still face the following three challenges.
1) As shown in Fig. 1, many objects are in the remote sensing image, including planes, cars, buildings, and other objects. In addition, the distribution of similar objects is uneven, and the scale of objects is inconsistent, i.e., different objects with various pixels. How to reasonably model the relationship between complex objects in remote sensing images and deal with the multiscale problem of remote sensing images has become the first challenge. 2) In addition, the relationship between words in the query text also needs to be quantified. Different terms have different contributions to other words. The second challenge is how to accurately quantify the contribution relationship between words in the query text. The third challenge is measuring the correlation between the query text and the remote sensing image and making the entity in the query text accurately match the object in the remote sensing image. In recent years, graph neural network has developed rapidly to model the relationship between vertices. Inspired by the graph neural network, we use the undirected fully connected graph structure to model the relationship between words in the query text. Words are regarded as the vertices, and edges are used to quantify the contributions of words to each other.
The relationship between data is not only a simple pairwise relationship but also a more complex relationship between multiple vertices. Unlike the ordinary graph structure in which one edge can only connect two vertices, the hyperedge in the hypergraph can connect any number of vertices, which makes the hypergraph naturally suitable for modeling multivertex relationships. Inspired by the superiority of hypergraph, we choose to use hypergraph to model the complex relationship between objects in a remote-sensing image and use the same hyperedge to connect the objects belonging to the same category, as shown in Fig. 1(c). To solve the multiscale problem of remote sensing image, we design high-level and low-level RS image hypergraph networks to learn the correlation between multiobjects at different scales. Specifically, for the high-level RS image hypergraph network, the high-level RS image features are used as the vertices of the hypergraph, and the related RS image features are formed as the hyperedges. The vertices and hyperedges of low-level RS image are similar to those of high-level RS image hypergraph network.
In this article, we introduce dynamic hypergraph learning into cross-modal remote sensing image retrieval. A hypergraphenhanced textual-visual matching network (HyperMatch) is proposed to circumvent the problem of inaccurate RS image and query text matching. To model the relationships between multiple objects in RS image at different scales, the high-level hypergraph network and low-level hypergraph network are designed, respectively. For the construction of a hypergraph, cosine similarity is employed to measure the correlation between objects. Hypergraph attention is elaborated for the dynamic alternating update of vertices and hyperedges for hypergraph evolution. In addition, an undirected fully connected graph network is applied to quantify the mutual contribution of words in the query text. Furthermore, the multiscale feature fusion and the image-guided multimodal fusion are designed to fuse the RS image features at different scales and extract the valuable text features for accurate matching with the RS image, respectively.
In summary, the contributions are as follows. 1) This article introduces hypergraph learning into crossmodal RS image retrieval and correspondingly propose a HyperMatch to avoid inaccurate matching between RS image and query text. 2) Aiming at the issue of multiple types, uneven distribution, and multiscale objects in RS images, the highlevel and low-level RS image hypergraph networks are designed to model the relationship between objects at different scales, respectively, to cluster the similar object features into a hyperedge. Besides, an undirected fully connected graph network is conceived to quantify the contribution of words to each other in the query text. 3) A dynamic hypergraph learning algorithm for RS image is proposed to measure the correlation between objects and realize the alternated updating of vertices and hyperedges.
4) Quantitative and qualitative experiments on the published RSICD and RSITMD datasets verify the effectiveness of the proposed method in cross-modal remote sensing image retrieval.

II. RELATED WORK
In this section, we mainly review the previous work that is most relevant to our proposed method, including cross-modal remote sensing image retrieval, text-image matching in natural scenes, and hypergraph learning.

A. Cross-Modal Remote Sensing Image Retrieval
To address the modality discrepancy caused by imaging mechanisms of synthetic aperture radar (SAR) and optical images, Xiong et al. [5] propose a cross-modality hashing network to extract the contour and texture shared features from across modalities. A CNN-RNN framework accompanied by beam search is exploited in [6] to generate multiple captions for retrieving RS images. Mao et al. [7] design a deep visual-audio network to directly capture the correspondence of image and audio for speech-to-image retrieval. Demir et al. [8] introduce hashing-based approximate nearest neighbor search to project high-dimensional image feature vectors into compact binary hash codes for content-based image retrieval. Hang et al. [9] propose an unsupervised feature learning model using multimodal data, hyperspectral, and light detection and ranging (LiDAR). A multiscale progressive segmentation network is proposed in [10] to address the issue of simultaneously segmenting objects with large-scale variations in high-resolution remote sensing imageries. Hang et al. [11] propose a spectral super-resolution network guided by the spectral correlation and the projection properties of hyperspectral imagery. To cope with cross-source RS image retrieval, Li et al. [12] introduce a source-invariant hashing convolutional neural network which can be optimized in an end-to-end manner. To reduce the memory and improve the retrieval efficiency, Chen et al. [13] propose an image-voice retrieval network to capture more information on RS data for generating hash codes with low memory. A cross-source distillation network with a well-designed joint optimization configuration is proposed in [14] to solve the data drift in cross-source content-based RS image retrieval (CS-CBSIR). Lv et al. [15] explore an image translation-based framework to address the data drift in CS-CBSIR by mapping the source domain to the object domain and keeping the generated images' content similar to the origin. To reduce the occupancy and overhead of cross-modal RS image retrieval algorithm, Yuan et al. [16] come up with a concise but effective cross-modal retrieval method via contrast learning and knowledge distillation.

B. Textual-Visual Matching
Wang et al. [17] present a fusion layer-based approach to extract the relationship between crossmodal features and a straightforward gradient-updating method to reduce the computational complexity for textual-visual matching. Li et al. [18] devise an identity-aware two-stage deep learning framework to scan incorrect matchings and refine the matching results with a latent coattention mechanism. Lee et al. [19] present stacked cross-attention to discover the latent alignments between image regions and words in the text for inferring image-text similarity. To learn modality-invariant feature representations, a text-image modality adversarial matching method incorporating adversarial learning is introduced in [20]. Liu et al. [21] propose a graph-structured matching network to construct graph structure for image and text and exploit graph convolution to propagate node correspondence for inferring fine-grained phrase correspondence. To learn the matching relations between image and text, Ma et al. [22] employ convolutional architecture to encode the image and compose semantic words. Messina et al. [23] introduce a transformer-based relationship-aware network to map visual and textual modalities into a common abstract concept space by sharing the weights of self-attentive layers. To capture the interrelationship of cross-modalities, Nguyen et al. [24] introduce a local and global scene graph matching model to extract and learn insightful features of nodes and edges from image and text graphs. Gu et al. [25] incorporate image-to-text and text-to-image generative models into cross-modal feature embedding for learning high-level and local-grounded representation.

C. Hypergraph Learning
To uncover complex higher-order interactions in different applications, Zhang et al. [26] develop a new self-attentionbased graph neural network for handling homogeneous and heterogeneous hypergraphs with variable hyperedge size. For the adapting of hypergraph topology, Zhang et al. [27] devise a hypergraph Laplacian adaptor which adopts a self-attention mechanism to capture global information and trainable distance matrix to empower the updating of the topology in an end-toend manner. To explore the data distribution's local structure, Ma et al. [28] present an approximation algorithm of hypergraph p-Laplacian regularization to preserve the geometry of the probability distribution. Duan et al. [29] present a local constraint-based sparse manifold hypergraph learning algorithm to discover the manifold-based light structure and the multivariate discriminant sparse relationship of hyperspectral images. Wei et al. [30] introduce an information-sharing mechanism to share the same structural distribution while preserving the specificity of each low-dimensional representation via adjusting the view-dependent hyperedge weights. To reduce the dimension of the hyperspectral image, Luo et al. [31] propose a sparse-adaptive hypergraph discriminant analysis method for adaptively revealing the intrinsic structure relationships with sparse representation.

III. PROBLEM
In this work, we focus on text-based cross-modal RS image retrieval. Therefore, establishing a text-image matching model is the primary problem to be solved. RS image possesses the attributes of multiscale and multiobjective, so how to reasonably model the relationship between complex objects and deal with the multiscale issue becomes the first challenge. In addition, Fig. 2. Overview of HyperMatch Aiming at the issue of multiple types, uneven distribution, and multiscale objects in an RS image, the multiscale RS image hypergraph networks are designed to model the relationship between objects at different scales by clustering the similar object features into a hyperedge. Besides, a textual fully connected graph network is conceived to quantify the contribution of words to each other in the query text. In addition, we develop a cross-modal matching module to grasp the coreference relationship and improve the retrieval accuracy.
the query text is composed of multiple words, and different terms have different contributions to others. How to accurately quantify the contributing relationships among words is the second challenge. Corresponding relationships exist between the objects in the RS image and the entities in the query text. The third challenge is ensuring that the objects match the related entities.
Main problem (Cross-modal matching): Given an RS image I and a query text T , the goal of cross-modal RS image retrieval is to build a model F to measure the matching degree S between them, i.e., where S denotes the cross-modal similarity for measuring the matching degree. Challenge 1 (Relationships between objects): For an RS image I, it owns multiple objects at different scales. Thus, it is necessary to devise a module not only to solve the multiscale of objects but also to cluster the objects that belong to the same categories, i.e., where I high updated is the RS image features encoded high-scale object information, F high 1 (·) represents the module to model the relationships between high-scale objects, and ϑ high 1 stands for the learnable parameters. The meanings of I low updated , F low 1 (·), and ϑ low 1 are similar to I high updated , F high 1 (·), and ϑ high 1 .

Challenge 2 (Relationships between entities):
The contribution of entities to each other in the query text is different, so we need to build a module to measure the contribution relationships, i.e., where T updated is the updated query text features, F 2 (·) represents modules for learning the contribution relationships between entities, and ϑ 2 stands for the learnable parameters. Challenge 3 (Objects matching entities): RS images retrieved through query text usually have a high degree of compatibility, which is mainly reflected in the correspondence between the objects of the RS image and the entities in the query text. Therefore, it is essential to construct a matching method to learn the correspondence relationships, as follows: where S denotes the cross-modal similarity for measuring the matching degree, F 3 (·) refers to the matching method for objects in RS image and entities in query text.

IV. METHODOLOGY
To accurately retrieve the RS images according to the query text or find the appropriate descriptions through the RS image, we construct a hypergraph-enhanced textual-visual matching network named HyperMatch. As illustrated in Fig. 2, Hyper-Match contains RS image feature extraction, query text feature extraction, multiscale RS image hypergraph network, textual fully connected graph network, and cross-modal matching. In the following, we will introduce the components in detail.

A. Preliminaries
Hypergraph definition: A hypergraph with n vertices and m hyperedges can be defined as represents the set of vertices and hyperedges, respectively. W = diag(w 1 , w 2 , . . . , w m ) is a diagonal matrix of hyperedge weights. The structure of a hypergraph can also be formulated by an incidence matrix H ∈ R n×m , with Dynamic hypergraph learning: According to the characteristics of multiobjective and multicategory in RS image, a welldesigned dynamic hypergraph learning algorithm is introduced to automatically model the association relationships between multiple objects and cluster congeneric objects into a same hyperedge. As demonstrated in Fig. 3, the algorithm consists of three processes, i.e., hypergraph construction, hyperedge update, and vertex update.
Hypergraph construction: Given the feature matrix M = {v i } n i=1 of RS image, each element/vector v i in the matrix is regarded as a vertex V i . To reasonably connect the relevant features by a hyperedge, for each vertex, cosine distance cosine(v i , v j ) is employed as the measurement metric to cluster its nearest k vertices into a hyperedge E i = v i ∪ {v m , . . . , x m+k−1 }. By this way, a hypergraph with n vertices and n hyperedges is formed, and the incidence matrix H ∈ R n×n is square that the hyperedge weight default to 1.
Hyperedge update: After the hypergraph construction, the hyperedges need to be updated via gathering their connected vertex information. Based on this, we conceive a hyperedge update mechanism. Whereby the specific structure of hypergraph, each hyperedge is considered the intermediary of vertex feature updating. In other words, the update of vertices needs to first aggregate the information of its connected hyperedges rather than directly update with adjacent vertices.
With n vertices {V k } n k=1 connected by a hyperedge E j , hyperedge update aims to emphasis on the significant vertices by calculating the contribution of the vertices to the hyperedge and then aggregates them to update the hyperedge feature e j where σ is the nonlinearity activation, W v , W v , and a T v are weight parameters, b v denote the learnable bias.
Vertex update: Contrary to the hyperedge update process, vertex update is devised for converging the hyperedge information to update the connected vertices. Given a set of hyperedges Y = {· · · , E m , . . . , E n , · · · } that are connected to a vertice V k , the update process of the vertice feature v k can be formalized as where v l k refers to the updated feature of vertex V k that gathers information from all of its connected hyperedges Y. W e , W e , and a T e are weight parameters, and a T e is for the sake of measuring the significance of the hyperedges to vertex V k .

B. Feature Extraction
RS image feature extraction: As for RS images, we resize them to 256 × 256 pixels, and randomly crop and rotate the images to extend the training samples.To avoid over-fitting due to the deep backbone, we apply the ResNet-18 model [32] pretrained on the ImageNet dataset [33], [34], [35] following [36] to extract last convolution layer's feature maps with size 512 × 8 × 8, i.e., v g = ResNet(I) where v g denotes global feature of RS image.
Although v g contains the global information of the RS image, it still encounters the bottleneck in accurately expressing the multiscale properties of objects. To solve the above issue, we follow [4] to up-sampling the feature maps of the first three layers of ResNet-18 and concatenate these feature maps together as low-level RS image features V low I . In addition, the feature maps of the last two layers are sampled and connected as high-level RS image features V high I . Query text feature extraction: A query text can be regarded as a word sequence {x i } n i=1 . Considering the temporal information in query text, we first exploit BiGRU [37] as a text encoder to refine each embedded word e(x i ) from forward and backward directions. Afterward, the generated bidirectional hidden states are averaged to avoid dimension amplification, as follows: where h t refers to the hidden state of word x i containing forward and reverse query text information. All the hidden states {h t } n t=1 compose of the features of query text V T ∈ R n×d .

C. Multiscale RS Image Hypergraph Networks
To handle the multiscale properties of objects in RS images, we develop the multiscale RS image hypergraph networks. Specifically, based on the extracted high-level RS image features V high I containing high-level semantic information, a high-level RS image hypergraph network is established to capture the relationships between high-level objects, cluster similar high-level objects into a hyperedge through dynamic hypergraph learning, and promote the information interaction between the objects. Similarly, the low-level RS image hypergraph network is also constructed to determine the relationships between low-level objects.
where HG high I refers to the constructed high-level RS image hypergraph, W high I is the weight of the hyperedge default to 1, and HG(·) is the method of constructing the hypergraph.
The constructed hypergraph models the relationships between high-level objects without making an object learn knowledge from other related objects. Thus, dynamic hypergraph learning in IV-A is adopted for hypergraph iterative updating. Specifically, the hyperedge update mechanism HyperedgeUpdate(·) aggregates the relevant object features into their connected hyperedges following the contribution of these objects to the hyperedges where E high I is the updated hyperedge features. Since various hyperedges connect a vertex/object feature, the hyperedge that gathers relevant high-level object features is regarded as a relay station further to feedback the hyperedge whereV high I is the updated vertex features that contains all highlevel object information in RS image. Finally, the above process is repeated multiple times to obtain sufficient information for high-level objects.
2) Low-Level RS Image Hypergraph Network: Similar to the process of high-level RS image hypergraph network, first, the low-level RS image features V low I are considered as the vertices V low I of the low-level RS image hypergraph network. Thereafter, the hypergraph construction method HG(·) is employed to cluster the relevant low-level RS objects into hyperedges E low I , and the incidence matrix H low I is calculated. Eventually, the hyperedge HyperedgeUpdate(·) and vertex VertexUpdate(·) update mechanisms are repeatedly exploited to promote the iterative interaction of hyperedge and vertex features, to achieve the fusion of the most relevant information of other objects for each low-level object. All formulation processes are as follows: where HG low I , E low I , andV low I represent the constructed lowlevel RS image hypergraph, the updated hyperedges and vertices, respectively. 1

D. Textual Fully Connected Graph Network
The query text is composed of various words, and different terms are of varying importance to the retrieval task. For instance, some entities (such as "plane" in Fig. 2) play a decisive role in RS image retrieval. In addition, there are internal relationships between words, e.g., "gray" for modifying "plane." Therefore, to model the relationships between arbitrary pairwise terms and capture the contribution of other words to a word, we elaborate a textual fully connected graph network.
The text features V T extracted by BiGRU are utilized as the graph's vertices, and an edge connects arbitrary pairwise vertices to build the fully connected graph. The mutual contribution of words determines the weight of the edge. Given a fully connected graph with n vertices, the significance of an edge/contribution can be calculated as follows: where v ∈ V T is a vertex in the fully connected graph, m k ∈ V T is a vertex sharing the same edge with v, and a j represents the weight of the edge connecting v and m k , which is used to measure the significance of m k to v.
With the weight of each edge in the fully connected graph, each vertex/word aggregates information from other vertices/words according to their importancê wherev t i is the ith vertex feature that converges all the word information in query text. To make the words learn more finegrained information from other words, we repeat the above process multiple times and receive the query text feature matrix

E. Cross-Modal Matching
There is usually a coreference relationship between the objects in the RS image and the entities in query text, as shown in Fig. 1, the two planes in the RS image correspond to "grey plane" and "blue plane" in the query text, respectively. To grasp the coreference relationship and improve the retrieval accuracy, we conceive a cross-modal matching method, which is divided into two modules, i.e., dynamic multiscale feature fusing and image-induced multimodal fusing, as shown in Figs. 4 and 5 respectively.
1) Dynamic Multiscale Feature Fusing: Given the updated high-level featuresV high I , low-level featuresV low I , and global feature v g , the intention of the module is to dynamically fuse the above three features to solve the multiscale problem of RS images. Specifically, for the updated high-level features, the convolution neural network(CNN) with built-in 1 × 1 kernel Conv 1×1 (·) is adopted for preliminary encoding and is activated by ReLu(·). Then, the average pooling Avg(·) is used to reduce the feature dimension, and finally, the convolutional neural network is adopted again for deep encoding, as follows:  wherev high I represents the condensed feature containing largescale object information.
The processing of the updated low-level features is consistent with that of the high-level features. The only difference is that 3 × 3 convolution kernel is utilized instead of 1 × 1 in CNN to ensure the consistency of feature dimensions and facilitate subsequent fusion, that iŝ v low wherev low I represents the fine-grained feature covering the information of small-scale objects.
Finally,v high I ,v low I , and v g are multiplied at the element level to obtain the final representationv I of the RS image.
2) Image-Induced Multimodal Fusing: To establish the feature association between RS image and query text, we design an image-induced multimodal fusing module to guide the RS image feature that integrates the high-level, low-level, and global features to locate the relevant or significant features in the query text. First, the updated RS imagev I and text featuresV T = {v t i } n i=1 are projected by affine transformation. Afterward, the bilinear similarity is exploited to measure the correlation between them. Finally, the features that match the RS image in the query text are weighted and summed to obtain a new feature.

This yields
a jv t i (18) wherev T represents the query text feature condensed according to the correlation intensity with RS image.

F. Triplet Loss
We choose the triplet loss as the loss function following [4] for increasing the distance between the sample and its corresponding negative samples and making the distance between the sample and its positive samples as close as possible: where α represents the margin, [x] + = max(x, 0), and S(I, T ) represent the similarity of the RS image and text.

A. Settings
Datasets: In this article, we select two public datasets (please refer to Fig. 6), i.e., RSICD and RSITMD datasets, to verify the model's effectiveness. RSICD dataset [31] is a large-scale and diverse RS image caption dataset containing 10 921 images and 30 scenes, and has become the preference for RS image caption tasks. RSITMD dataset [4] is a fine-grained dataset dedicated to RS cross-modal text-image retrieval. Some images in this dataset are selected from the RSICD dataset, while others are from Google Earth, including 4743 images, 23 715 captions, and 24 scenes.
Settings: All the experiments are performed on pyTorch [41], running on a Tesla V100 GPU with 32G memory. For the RS image, the image embedding dimension size is 512. The word embedding dimension is set to 256, and the hidden layer of the BiGRU is set to 512. In this manner, the dimension of the RS image and query text can be kept consistent for the subsequent feature interactions. In terms of hypergraph construction, for high-level and low-level RS image hypergraph networks, the number of vertices connected by a hyperedge is fixed at 6. In pursuit of a balance of complexity and efficiency, the update times of textual fully connected graph network and multiscale RS image hypergraph networks are set to twice. Adam is selected as the optimizer to train the network up to 50 epochs with the batch size set to 128. During the training period, the learning rate was adjusted to 1e −4 , and was decreased by 0.7 every 5 epochs. For evaluation indicators, R@K (K=1, 5, and 10) and mR are applied to evaluate the performance of the proposed model. R@K represents the percentage of ground truth that appears in topK results. Moreover, to reasonably evaluate the model's overall performance, we also use the average of six recall rates to obtain mR.

B. Baselines
We select several previous state-of-the-art models, which are specially oriented to image-text matching, as the comparison baselines to verify the effectiveness of our model, as follows.
VSE++ [42]: In [42], image information and text information are embedded into the same space by using convolution network and recursive network, and utilizing triple loss to train image-text matching model. SCAN [19]: SCAN, which on the foundation of VSE++, applies faster RCNN [50] to extract image features and attempts to align the corresponding objects in the RS image and query text.
CAMP [43]: CAMP introduces an adaptive message passing mechanism to control the flow of information transmission between different modes adaptively and uses the fused features to calculate the matching degree of image and text.
MTFN [36]: MTFN leverages rank decomposition to construct a multimodal fusion network for calculating the distance of embedded features.  I  EXPERIMENTS OF SENTENCE-TO-IMAGE RETRIEVAL AND IMAGE-TO-SENTENCE RETRIEVAL ON RSICD TEST SET   TABLE II EXPERIMENTS OF SENTENCE-TO-IMAGE RETRIEVAL AND IMAGE-TO-SENTENCE RETRIEVAL ON RSITMD TEST SET Table I summarizes the experimental results of HyperMatch on the RSICD dataset. It can be seen from Table I that the proposed HyperMatch achieves significantly improved performance compared with the state-of-the-art models in both sentence-to-image and image-to-sentence retrieval. In mR metric, HyperMatch outperforms the optimal CAMP-Triplet model by 5.36%. In using a sentence as a query to retrieve RS images, HyperMatch improves by 2.02%, 7.15%, and 9.90% in R@1, R@5, and R@10 metrics, respectively. The experimental results demonstrate that given a query sentence, HyperMatch can better match RS images related to a sentence. At the same time, retrieving sentences whereby an RS image improves performance by 1.93%, 5.14%, and 6.01%, respectively, verifies image-to-sentence retrieval effectiveness. In addition, the experimental results on the RSITMD dataset (please refer to Table II) also show that the performance of HyperMatch receives competitive performance in most indicators, e.g., it exceeds the MTFN model with the best performance by 0.74% in mR indicator.

C. Comparisons
HyperMatch can achieve superior performance on the selected datasets, mainly including the following aspects. On the one hand, aiming at the situation of multitypes, uneven distribution, and multiscales of objects in an RS image, the high-level and low-level RS image hypergraph networks are well-designed to model the relationships between objects at different scales, and cluster the features of similar objects into a same hyperedge. On the other hand, an undirected fully connected graph network is conceived to quantify the mutual contribution of words in the query text. Furthermore, the constructed cross-modal matching module learns the coreference relationships between the objects in the RS image and the entities in the query text.

D. Ablations
To explore the importance of each pivotal component of the proposed model, ablation experiments are performed in the selected two datasets in this section. The results are summarized in Tables III and IV. From the experimental results in Table III, it can be observed that when the high-level RS image hypergraph network is eliminated from the model, the performance of in sentence-toimage retrieval task is decreased by 0.6%, 1.42%, and 1.56%, respectively on the R@1, R@5, R@10 indicators. In the imageto-sentence retrieval task, the performance also declined by 0.59%, 0.81%, and 1.5%, respectively. The main reason for the performance degradation is that the objects in RS images possess the characteristics of multitype and multiscale. Once the high-level RS image hypergraph network is removed, the relationships between large-scale objects will not be modeled, nor can the similar type of large-scale objects be clustered into the same hyperedges through dynamic hypergraph learning to realize the information interaction between objects. In addition,  an interesting phenomenon is that the performance degradation after removing the module is not particularly obvious, mainly because the low-level RS image hypergraph network, which can model the relationship between small-scale objects, has not been eliminated, compensating for the performance degradation caused by the elimination of high-level RS image hypergraph network. Analogously, while eliminating the low-level RS image hypergraph network, the performance on mR metric decreases by 1.09%, which verifies the ability of the module to model the relationships between small-scale objects and aggregate information between objects based on the relationships. Noting that the improvement brought by low-level hypergraphs is not as significant as that brought by high-level hypergraphs, especially on image-to-text retrieval. We attribute it to the fact that high-level hypergraphs absorb high-level semantic information; compared with more implicit and localized underlying semantic information, the relational modeling of high-level information plays a more important role for the model to recognize global image information, which is more conducive to be mined by text features.
The textual fully connected graph network regards words as vertices and the contribution of words to each other as edges. By utilizing the self-attention mechanism on the fully connected graph network, each word can aggregate information according to the importance of other words. Therefore, when the module is removed, the performance decreases significantly, e.g., on the sentence-to-image retrieval task, in terms of R@1, R@5, R@10 indicators, the performance decreased by 1.56%, 2.2%, and 2.57%, respectively.
The original intention of the dynamic multiscale feature fusing module is to combine the large-scale and small-scale object features learned from high-level and low-level RS image hypergraph networks. Also, the global feature containing global information of the RS image is dynamically fused to deal with multiscale problems of objects. Thus, after removing this module, the performance decreased significantly, e.g., by 1.54% in mR metric.
To establish feature association between remote sensing image and query text, a dynamic image-induced multimodal fusing module is designed to guide the positioning of the most relevant or significant features in the query text by integrating the high-level, low-level, and global features of RS image. After the module is removed, the performance degrades obviously, e.g., in sentence-to-image and image-to-sentence retrieval, R@1, R@5, and R@10 decrease by (1.2%, 2.84%, 3.3%) and (1.16%, 1.62%, 2.99%), respectively, which illustrates the importance of this module for cross-modal remote sensing image retrieval. Table IV shows the ablation experiment results on the RSITMD dataset, from which it can be observed that the elimination of the five vital components aforementioned decreases the mR metric by 1.27%, 0.74%, 2.75%, 2.09%, and 3.3%, respectively. The phenomenon and analysis of performance degradation are the same as those in Table III, and will not be described here for simplicity. Note that the decline is most significant when removing the textual fully connected graph network and dynamic image-induced multimodal fusing, illustrating the necessity of self-attention-based weighted aggregation of entity information in text and the effectiveness of learning feature associations between RS image and query text.

E. Case Study
To intuitively demonstrate the proposed model's performance in text-to-image and image-to-text retrieval, we select several examples (as shown in Fig. 7) for analysis of the two tasks.
Case 1 in Fig. 7(a) shows the retrieved five RS images that are most relevant to the content of query text, that is, " T hree white planes were parked in the gray parking lot ." From the retrieved results, we can see that the proposed HyperMatch can accurately retrieve the best suitable RS image (i.e., Rank 1) consistent with the ground truth according to the query text, which shows the superior ability of the model in text-image retrieval. In addition, the remaining retrieved RS images are also highly related to the content of the query text. In particular, the objects in the RS images are in accordance with the keywords of the query text (e.g., "planes" and "parking lot"), which verifies the rationality of the retrieval results. For the retrieval results of Case 2, the ground truth is ranked second. Even so, the content of the Rank 1 is highly similar to that of the ground truth, e.g., there are four water tanks in both RS images. The remaining three retrieved RS images also contain the key entities in the query text, that is, the " tank, " which further verifies the ability of the model to retrieve RS images by query text. Fig. 7(b) illustrates the retrieved relevant captions according to an RS image. It can be found from Case 1 that all three ground truths are included in the five top-ranked results retrieved through an RS image, and the keyword "pool" in the retrieved nonground-truths, i.e., Rank 2 and Rank 4, is also in keeping with the object in the RS image. The retrieval situations in Case 2 are similar to Case 1. On the one hand, all three ground truths are retrieved. On the other hand, the remaining two captions that are not within the ground truths are also related to the content of the RS image, such as "boat" and "bridge" in Rank 3 and "boats" and "water" in Rank 4. Two cases demonstrate the competitive performance of the proposed model on image-to-text retrieval.

F. Visualization
To visually show whether the proposed model can accurately locate critical components (such as object positions) in RS images according to query text, we verify this capability in the semantic localization task, which refers to locating the regions that best match the query text in a large scene. Following the work proposed in [4], we first use the various sliding window to cut the large scene image to maintain the multiscale characteristics of the object. Afterward, the similarity between each patch obtained after segmentation and the query text is calculated to form a probability map. After that, the obtained probability distributions are combined, and the median filter is utilized to remove the impact noise in the probability map to ensure that the results are robust. Finally, the probability map is fused with the original RS image to generate a located image that can intuitively display the semantic positioning ability of the model. Fig. 8 illustrates the selected two examples.
Example (a) in Fig. 8 aims to locate football grounds surrounded by cars and houses from a large scene RS image. From Fig. 8. Visualization of semantic localization results. Either (a) or (b) contains the query text, the segmented RS image, the probability map, and the located RS image. Among them, the segmented RS image is cut into multiple patches of different scales in various ways according to the method proposed in [4]. Probability map refers to the probability distribution heat map formed by concatenating the similarity between each patch in the segmented image with the query text. The located image is generated by "fusing" the probability map with the original RS image, which is convenient for visually discovering the places in the RS image that are related to the query text.
the located image, we can observe that several football grounds in the RS image are located. From the probability map, one can also find that the parts with high probability (coloured in orange and red) form a "circle" shape, which perfectly corresponds to the keyword "surrounded" in the query text, demonstrating that the proposed model can not only locate the objects in the large scene RS image according to the query text, the spatial relationships between the objects can also be understood. In example (b), we attempt to locate the playgrounds and trees near the buildings. From the located image, it can be found that the two playgrounds are accurately located. Also, the two playgrounds are given the highest probability (the deepest colour) in the probability map, which confirms the model's capability in the semantic localization task.

VI. CONCLUSION
Cross-modal RS image retrieval is to retrieve RS images using other modalities such as text or query other modalities via RS images. The multiscale and multicategory characteristics of objects in RS images make it difficult to match the short query text, further restricting the performance of RS image retrieval. The hyperedge in a hypergraph can connect the arbitrary number of vertices and have significant advantages in representing highorder complex relationships in data. In recent years, hypergraph learning has attracted extensive attention and developed rapidly. Therefore, this article introduces it into cross-modal RS image retrieval and proposes a HyperMatch to realize the accurate matching between RS image and query text by learning the spatial relationships between objects in RS image, the contribution relationships between words in query text, and the corresponding relationships between the objects in RS image and the entities in query text.
Specifically, high-level and low-level RS image hypergraph networks are constructed, respectively, to model the relationships between objects of different scales and cluster similar object features into the same hyperedge. For the construction of a hypergraph, cosine similarity is utilized as the metric to measure the correlation of features in the RS image. For the dynamic update of a hypergraph, vertex attention, and hyperedge attention are designed to realize the dynamic alternating update of vertices and hyperedges. Experiments on the published RSICD and RSITMD datasets verify the effectiveness of HyperMatch in cross-modal RS image retrieval. In the future, we will explore the feasibility of applying hypergraph learning to other multimodal tasks, such as modeling high-order relationships within and between modalities.