From Easy to Hard: Learning Language-guided Curriculum for Visual Question Answering on Remote Sensing Data

Visual question answering (VQA) for remote sensing scene has great potential in intelligent human-computer interaction system. Although VQA in computer vision has been widely researched, VQA for remote sensing data (RSVQA) is still in its infancy. There are two characteristics that need to be specially considered for the RSVQA task. 1) No object annotations are available in RSVQA datasets, which makes it difficult for models to exploit informative region representation; 2) There are questions with clearly different difficulty levels for each image in the RSVQA task. Directly training a model with questions in a random order may confuse the model and limit the performance. To address these two problems, in this paper, a multi-level visual feature learning method is proposed to jointly extract language-guided holistic and regional image features. Besides, a self-paced curriculum learning (SPCL)-based VQA model is developed to train networks with samples in an easy-to-hard way. To be more specific, a language-guided SPCL method with a soft weighting strategy is explored in this work. The proposed model is evaluated on three public datasets, and extensive experimental results show that the proposed RSVQA framework can achieve promising performance.


I. INTRODUCTION
I MAGES from spaceborne and airborne platforms usually cover large-scale geographical areas and provide important Fig. 1.Motivation of the proposed method: learning features from easy samples to hard ones.data bases for many Earth observation (EO) applications [1].With the development of EO technology, there have been an increasing number of works on remote sensing image analysis, such as land use classification [2], [3], object detection [4], [5], road extraction [6], and change detection [7], [8], [9].However, due to the specialised nature of remote sensing tasks, the ability to carry out such tasks is limited to experts in the related fields.The obtained semantic information from some remote sensing tasks is not intuitive to common users, which makes it difficult to deliver the image information to users in domain-specific applications.
Fortunately, novel tasks such as image captioning [10], [11] and visual question answering (VQA) [12], [13] have recently been explored for visual data.These tasks take both natural language and imagery as inputs and output easy-to-understand text in natural language.Among them, VQA has become a hot research topic in artificial intelligence community [14].Given an input image and a natural language question, VQA aims to generate a textual answer to the question based on image content [15].It is an interdisciplinary research area between computer vision and natural language processing [16].It is also a challenging task that requires a model to jointly learn multi-modal representation from both imagery and language data.More specifically, a VQA model needs to learn visual representation to understand the input image and effective features for natural language to gain an answer conditioned on image content [17].
As for remote sensing data, VQA enables end-users to better understand a complicated remote sensing image and has great potential in human-computer interaction applications [18].A pioneer work can be found in [19], where the authors created  2. Main architecture of the proposed VQA method.1) Firstly, multi-modal features are extracted from the two types of inputs, including visual features from the given image and language features from the question; 2) Then, visual features and language features are somehow fused to get the multi-modal representation; 3) Finally, the answer is predicted via a classifier.two datasets and proposed a baseline model of VQA for remote sensing data (RSVQA).Although VQA in computer vision has been widely studied [20]- [23], VQA for remote sensing imagery is still in its infancy.Due to different image characteristics, VQA methods for natural images may not work well on remotely sensed images.Specifically, two main challenges for the RSVQA task are summarized as follows.
• No object annotations available in RSVQA datasets.In computer vision, VQA models are able to make use of existing object annotations to learn features tailored to objects, which helps a lot to improve performance [24], [25].In contrast, there are no object annotations available in RSVQA datasets, making it difficult for RSVQA models to take advantage of informative region information.However, this may confuse the model and hence affect the final performance, as easy and difficult questions are in the same batch.Aiming at the above-mentioned two challenges, our motivations are explained from two aspects.First, both holistic and region features should be well exploited to enhance visual representation for RSVQA.Though the holistic feature provides the global information of the input image, it may neglect some important details, whereas the region feature can provide more detailed semantic information, which is critical for answering complicated questions.Moreover, due to the fact that remote sensing images usually contain objects of various scales, using region representation is also helpful for addressing the scalevariation problem.To harness both two features, we propose a multi-level visual feature learning method.Specifically, the language-guided holistic image feature and the region feature are jointly learned to improve the performance of RSVQA models.
Second, the model should be trained in ascending order of learning difficulty.We humans tend to learn from easy to hard.Inspired by the human learning process, self-paced curriculum learning (SPCL) is explored in this work and shows promising results.It considers question attributes and model feedback to dynamically adjust the question sequence for model training in ascending order of difficulty [26], [27], namely, from easy samples to hard ones.Albeit successful in many problems, SPCL for the RSVQA task still remains under explored.
To sum up, the main contributions of this work can be summarized as follows: • A multi-level visual feature learning method is proposed to jointly exploit both holistic and region features.Specifically, a cross-modal global attention (CGA) module is devised to learn the language-guided holistic image feature, and a cross-modal spatial transformer (CST) module is developed to learn the question-related region feature.
• The proposed CST module applies affine transformation to visual features to automatically crop informative regions without object annotations.Moreover, the language feature is also used as guidance to generate multiple spatial transformation parameters for obtaining richer region features.• A language-guided SPCL method with a soft weighting strategy is devised for RSVQA.It takes question length and type as prior knowledge and dynamically adjusts question sequence to enable a more effective training process: learning with easy questions and then with hard ones.The rest of the paper is organized as follows.Related works about VQA for both natural images and remote sensing images are introduced in Section II.The methodology is described in Section III.Section IV presents experimental results and discussion.Finally, this paper is concluded in Section V.

II. RELATED WORK
Multi-modal feature learning [28]- [30] plays an important role in both remote sensing and computer vision tasks.For a typical VQA framework, learning multi-modal representation is also one of the core components.Mateusz et al. [31] combined semantic segmentation of scenes and symbolic reasoning over questions to learn multi-modal features.With the development of deep learning, convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are usually employed for visual and language feature encoding and become mainstream feature learning methods [17], [32], [33].Stanislaw et al. [15] introduced the task of free-form and open-ended VQA and employed VGGNet and LSTM to extract multi-modal features.
Besides, visual attention mechanism is also widely used in VQA tasks to make the model focus on important pixels.Chen et al. proposed a language-guided attention method that projects question embeddings into a visual space and to learn multi-modal features [34].Fukui et al. [35], Kimet al. [36], and Ben et al. [37] designed different multimodal bilinear pooling methods to integrate visual features with language features.Yu et al. [38] further reduced the co-attention method into question self-attention and question-conditioned attention for learning better visual features.However, softly-attended visual features are still holistic representations of the image.The detailed region information is neglected by these methods, which is critical for alleviating the scale-variation problem in the RSVQA task.
To leverage the object-level semantic information of the input image, patch-based and object-based feature learning methods are proposed to extract more representative local features.For VQA models in computer vision, object detectors are usually used to represent the image as a collection of bounding boxes [39]- [41].Recently, vanilla grid convolutional feature maps [42] are also proven to be effective for visual feature learning in VQA and image captioning tasks.However, these methods all rely on bounding box annotations, which are not available in RSVQA datasets.
Compared with VQA for natural imagery, the research for RSVQA is still in its early stage.The first work for VQA on remote sensing data was introduced by Sylvain et al. [19], where an template-based automatic method was designed to build two remote sensing-oriented datasets.The image-questionanswer triplets of the two datasets were constructed via the information from OpenStreetMap and pre-defined templates.They employed a CNN to extract visual features and an RNN to learn language features.After the point-wise fusion of the multimodal features, an answer is predicted by a classification task.This work paves the way for RSVQA by providing datasets and a baseline method for further research.However, due to unique characteristics of remote sensing imagery, more specific feature learning algorithms need to be investigated and explored for this task.

III. METHODOLOGY
As we mentioned above, this work focuses on two problems.On the one hand, how can we adaptively exploit both holistic and regional visual features for answering different types of questions.On the other hand, how can we more effectively train a model with questions of different difficulty levels.The whole architecture of the proposed RSVQA framework is shown in Fig. 2. It consists of two parts: 1) CGA and CST modules that intend to learn multi-level visual features with multiple spatial contexts; 2) SPCL-based network training, which aims to train a model in an easy-to-hard way.Following prior works [19], [43], we formulate predicting the answer conditioned on the input image and question as a classification task instead of a sentence generation task.Specifically, the final answer can be selected from the class with the highest probability.In the following sections, the two parts of our RSVQA framework will be described in detail.

A. Multi-level Visual Feature Learning with Multiple Spatial Contexts
Compared with natural images, remote sensing data usually contain much richer content due to top-down views, which enables people to ask various types of questions for the same input image.Existing RSVQA datasets in [19] contain the following five types of questions, and an example is given for each type: • Rural/Urban."Is it a rural or an urban area?"The answer to this question can be deemed as a typical binary classification task.• Presence."Is a road present?"To answer presence-related questions, a VQA model needs to predict whether there exist specific objects.• Comparison."Are there more roads than residential buildings?"We need to compare the areas or objects involved.• Area."What is the area covered by residential areas?" The model needs to seek out the target according to the question.• Count."What is the amount of small buildings?"To answer count-related questions, the numbers of specific objects need to be predicted.As can be seen, different image regions and spatial contexts need to be employed to answer various types of questions.Thus, we propose to combine holistic (global) and region (local) features to learn multi-level visual representation with multiple spatial contexts.Specifically, CGA is designed to extract global visual features with the guidance of language.CGA utilizes language features as the guidance to generate global attention maps on the whole image.Meanwhile, the CST module is proposed to extract region features in an adaptive way.CST learns to spatially transform feature maps to be of different scales and poses, and the transformed features can be used to enhance visual representations.Details of the two proposed modules are illustrated in Fig. 3 and are introduced in the following subsections.
1) Cross-modal Global Attention Module: Attention mechanism has shown its effectiveness in many computer vision tasks [28], [44], [45].This is mainly because focusing on some important regions of an image can improve the discriminability of visual features.For object recognition or semantic segmentation tasks, only one modality, i.e., vision, is input to the model, and self-attention is used.However, there are two modalities in our case.Therefore, we propose a cross-modal global attention, i.e., CGA, which exploits the language feature as guidance to generate global attention maps on all locations.Formally, let x be the input image, and the corresponding question is denoted as q.For visual and language modalities, CNNs and RNNs are commonly used as feature encoders.Correspondingly, visual feature F x ∈ R N ×C×H×W and language feature vector v q ∈ R N ×L can be obtained with the networks.N is the batch size, C is the number of channels, and H, W represent the height and width of the input image.L denotes the dimension of the language feature vector.The typical, single-modality attention mechanism first encodes the visual feature F x into three independent features: the query Q, key K and value V .The key idea of attention mechanism is to assign weights to the input value V according to a similarity function.Usually, the attention weights are computed by the compatibility between the input query and the corresponding key.Generally, the attention [44] for the single modality input can be calculated by where d k is the number of channels of F x .However, for RSVQA, we make use of the information of the natural language question to assign weights for visual features.To this end, we devise a multi-modal compatibility function to compute the similarities between language and visual features at different locations.Firstly, a 1×1 convolution layer and a fully connected (FC) layer are employed to transform the visual and language features into F attn x ∈ R N ×C×H×W and V attn q ∈ R N ×C .Then, we expand the dimension of V attn q to the same dimension as F attn x .Afterwards, the query Q attn can be computed as Finally, the cross-modal attention can be defined as where the value V is computed by V = Conv 1×1 (F x ), and d k is the channel dimension of F attn x .
2) Cross-modal Spatial Transformer Module: Attention mechanism can make model focus on informative image features by using soft weights.However, it is still a global feature learning method with a fixed spatial context.Compared with global features, multi-level features are more effective.Therefore, we propose a cross-modal spatial transformer, i.e., our CST module, to extract features with adaptive scales and spatial contexts.The spatial transformation in CST includes cropping, translation, and scaling, which can be learned in an end-to-end manner without object annotations.As opposed to the soft attention in CGA module, the CST module can be viewed as a hard attention method for region feature learning.
The spatial transformer in CST is a differentiable transformation module, and it is conditioned on both visual and language features.Specifically, there are three sub-components in the spatial transformer following the work in [46]: localization network, parameterized grid sampling, and differentiable bilinear sampling.
In this work, the transformation parameter T θ can be defined as where s 1 , s 2 , and t x , t y are parameters controlling the scaling, translation, and cropping transformation, respectively.
To predict the transformation parameter T θ , we design a cross-modal localization network.In CGA, F attn x and Q attn are computed for generating the cross-modal attention.In cross-modal localization network, we reuse F attn x and V attn q for predicting the transformation parameter T θ .Since different feature channels focus on different parts of the image [30], [47], we propose to use multiple spatial transformers from different channel groups to extract richer visual features.This can be defined as ). (5) We split the cross-modal feature into M attn 1 and M attn 2 along the channel dimension evenly.Then, two transformation parameters are predicted by split features as )), where FC denotes FC layers.T θ1 and T θ2 denote the predicted transformation parameters.
Note that two spatial transformers share a similar differentiable bilinear sampling process, which can be defined by where i ∈ {1, 2, ..., W H} is the coordinate index, and c is the channel index.Transformed spatial coordinates (x i , y i ), and the feature coordinates (u, v) of F x are normalized in the range of [−1, 1].E c i denotes the sampled features using the transformation parameter T θ1 or T θ2 .For each spatial transformer, we compute partial derivatives for both features and coordinates.By this means, the whole networks can be trained in an end-to-end manner.

B. Cross-modal Feature Learning: From Easy to Hard
Traditional stochastic training strategy usually takes the input samples in a random order, while this is opposite to the learning process of human.Curriculum learning (CL) [48], self-paced learning (SPL) [49], and SPCL [26] are proposed as more reasonable training algorithms for machine learning models.The core idea of them is to train a model starting from easy samples and gradually including hard ones.Previous works [27], [50] show that designing proper ranking functions to organize training samples in ascending order of learning difficulty is helpful for improving the model performance.Since there exist questions with different difficulty levels for the same remote sensing image, training an RSVQA model from easy to hard is a more reasonable strategy.
In this subsection, we propose a language-guided SPCL training method for RSVQA.Generally, SPCL is composed of two parts: SPL and CL.SPL can be reformulated as an optimization problem, and the curriculum is dynamically adjusted according to model feedback during the training phase.The curriculum in CL is determined by prior knowledge, which is a prejudgment about the difficulty level of a specific task.
In this work, the target of SPL is to adjust the sequence of input samples during the training stage.Specifically, SPL utilizes adaptive weights for each training sample to control the training sequence by an importance sampling strategy.Let v = [v 1 , v 2 , ..., v N ] denote the weight vector for each sample in N training questions.x i is the i-th input image, and q i is the i-th input question.g(h(x i ), s(x i ), q i ; w) represents the whole RSVQA model.Here, h(x i ) denotes the global feature learned by CGA module, and s(x i ) denotes the transformed feature learned by CST module.w represents the learnable network weights in the whole model.Then, SPL is exploited to train the model with samples organized in ascending order of learning difficulty.Based on the learned multi-level features, the SPL loss can be defined as where samples with larger v have larger influence on model training and vice versa.y i is the ground truth label.Since we take answer prediction as a classification task, its loss function is a cross-entropy function represented by L (y i , g). λ can be interpreted as the "age" of the model, which is used to control the learning pace.
Actually, f (v; λ) is a self-paced regularizer for controlling the learning process.The vector v is learnable and updated by optimizing the SPL loss function.Basically, elements of v can be hard (0 or 1) or soft (from 0 to 1).In what follows, we study the soft regularizer for SPL to enable a more flexible training of the RSVQA model.Specifically, the soft regularizer can be defined as follows: Given that there are two disjoint blocks of variables, i.e., network weights w and weight vector v, SPL is a biconvex optimization problem.Usually, the alternative convex search algorithm is used to solve it.When network parameters w including all learnable weights in CGA and CST modules are fixed, the global optimum v * for the regularizer, v * can be computed by otherwise.(10) After vector v is updated, we fix v and optimize network weights w by a stochastic gradient descent (SGD) optimizer.Since easy samples can be quickly fitted with limited iterations, the loss values for easy samples are usually smaller than those for hard ones.So if the loss value L is not larger than λ, the corresponding input question will be taken as an easy sample and trained with high priority.Otherwise, v * i is set to 0, and the corresponding question will not be used for training.
As we mentioned, λ is the "age" of the model that increases gradually along with the training iteration.In this work, we record the maximum and minimum loss values of epoch t − 1, and use them to update λ as follows: λ = (max(L t−1 ) − min(L t−1 )) • K + min(L t−1 ), (11) where K is used to adjust the value of λ.Specially, we define K as a dynamic changing parameter for controlling the learning pace.The initial value of K is set to 0.5, and it is updated during the training stage by When the value of λ increases, the model includes more difficult questions with larger loss values.However, SPL does not incorporate prior knowledge in the learning process.In the initial stage, the network weights are still randomly initialized, and the loss values of easy and hard examples may not be accurate to determine the true difficulty order.Thus, incorporating prior knowledge is necessary in our case.Inspired by CL, we design a curriculum, namely, a ranking function to organize questions in an easy-to-hard order at the beginning of training.By combining SPL with CL, the proposed language-guided SPCL method for RSVQA can take advantages of these two learning regimes.
Two factors are considered in this work to design the ranking function in CL: question length and question type.In most cases, longer questions are usually more complicated than shorter ones.In addition, different types of questions also have different difficulty levels.For example, object recognition is usually easier than counting task.
Based on this prior knowledge, the SPCL loss can be defined as: where Ψ = v | a T v ≤ c is a pre-defined curriculum region to initialize the weight vector v. c is a constant, and a is the ranking function that indicates the difficulty levels of training samples.Usually, Ψ can be derived from a taskspecific ranking function a and a constant c.In this work, a is defined by calculating a i = W q i Q q i for the i-th question.W q i denotes the pre-defined prior weight for different question types.Q q i is the length of question, which is normalized by dividing the maximum question length.Note that CL is only used in the initial stage, and SPL adaptively updates v in the rest of the training stage.

IV. EXPERIMENTS A. Datasets
In order to evaluate the proposed RSVQA framework, we conduct experiments on three public datasets.Two of these are released by [19]: the low resolution (LR) and high resolution (HR) RSVQA datasets.LR dataset is based on Sentinel-2 images at 10 m resolution.It contains 772 images of size 256×256 and 77,232 question-answer pairs.Among them, there are 23,002 (29.78%) pairs for the count question type, 22,882 (29.63%) for presence, 30,576 (39.59%) for comparison, and 772 (1.00%) for rural/urban.Overall, 77.8%, 11.1%, and 11.1% of original tiles are divided into the training set, validation set and test set, respectively.HR dataset is collected from high resolution orthoimagery data at 15 cm resolution.It consists of 10,659 images of size 512×512 and a total number of 1,066,316 question-answer pairs.Specifically, there are 277,702 (26.04%) pairs for count, 278,335 (26.10%) for presence, 353,772 (33.18%) for comparison, and 156,507 (14.68%) for area.In general, 61.5%, 11.2%, 20.5%, and 6.8% of original tiles are split into the training set, validation set, test set 1, and test set 2, respectively.
According to the answer type, these triplets can be divided into three types: yes/no, number, and others.Following the experimental setting described in [43], we randomly sample 80%, 10%, 10% of all triplets as the training set, validation set and test set, respectively.

B. Implementation Details
We use Adam optimizer with an initial learning rate of 1e-5 for model training.The batch size is set to 280 for LR and 70 for HR dataset.For methods without multi-level feature learning, we make use of 150 epochs and 35 epochs to train the models on the LR and HR datasets, respectively.Since more epochs are needed for models with the multi-level feature learning module to converge, 300 epochs on the LR dataset and 70 epochs on the HR dataset are used for these models.
We take the method in [19] as the baseline model and conduct experiments to demonstrate the effectiveness of the proposed method.Note that the same language feature embedding network is used in both the baseline and the proposed framework.As for the visual feature learning module, we employ the proposed CST and CGA modules.Additionally, the traditional cross entropy loss is replaced by the proposed SPCL loss function to enable the easy-to-hard learning strategy.To comprehensively evaluate the proposed modules, we exploit accuracy with respect to question type, average accuracy, and overall accuracy as evaluation metrics.Each model is trained 3 times in every experiment (except for the results in Table VI), and the mean and standard deviation are reported on both datasets.

C. Comparisons on the LR Dataset
The proposed framework consists of two main components, namely multi-level visual feature learning and SPCL.The The LR dataset contains four types of questions: rural/urban, presence, comparison and count.Taking into account the difficulty level of each question type, we set the prior weights W q i as {rural/urban: 1.0, presence: 1.0, comparison: 3.0, count: 4.0}.From Table I, we can see that the proposed method gains better results.An improvement of 3.27% in average accuracy can be achieved.In addition, the proposed method improves overall accuracy by 4.01%.We also find that much greater performance is obtained for all question types.Especially for comparison type, there is an improvement of about 6% in accuracy compared with the baseline method.These results indicate that multi-modal visual feature learning and SPCL training strategies can enhance the performance of RSVQA.
As shown in Fig. 4, the proportions of training samples for SPCL on the LR dataset are visualized.Note that this model does not include the MLL feature learning module.In this figure, the proportions of training samples for four question types are compared in detail.At the first 15 epochs, CL is used to initialize the weight vector v. Thus, the proportions of count and comparison question types are obviously smaller than those of rural/urban and presence.After the first 15 epochs, SPL is used to update the vector v to control the training order of different question types.The proportion of count type is the smallest at the beginning few epochs.Then, hard examples are gradually included.The results also support our assumption on the difficulty levels of different question types.
The precisions of different question types for SPCL during the training stage on the LR dataset are visualized in Fig. 5. From this figure, it can be clearly observed that easy questions can achieve higher precisions at the first 15 epochs.Afterwards, the precisions of more difficult question types are improved rapidly.Moreover, we can see that the precisions of easy question types are not affected by adding more difficult training samples.
The global attention maps and spatially-transformed maps are visualized in Fig. 6.The second column of the figure shows that global attention mechanism learns to focus on important pixels of remote sensing images.The third and fourth columns indicate that spatial transformers extract visual features of local regions.Note that we apply learned transformation parameters to the original remote sensing images instead of feature maps for a clear visualization.

D. Comparisons on the HR Dataset
Since there are two test sets provided by [19], we report results on both of them.The experimental results on the HR dataset are displayed in Table II and Table III.
Table II shows numerical results on the test set 1 of the HR dataset.There are four types of questions in the HR dataset: presence, comparison, count and area.Their corresponding prior weights W q i are set as {presence: 1.0, comparison: 3.0, count: 4.0, area: 4.0}.The results in Table II reveal that SPCL can consistently enhance the performance of RSVQA for all question types.Particularly, the comparison between baseline method and SPCL indicates that the VQA performance can be improved by simply replacing the cross entropy loss with our SPCL loss.This demonstrates the effectiveness of the proposed training strategy.The performance comparison between SPCL and SPCL+MLL shows that the multi-level visual feature is useful for this task.Comparing the baseline method with SPCL+MLL, we also find that improvements for easier question types i.e., comparison and presence, are larger than harder ones.The experimental results reported in Table III demonstrate that the proposed method can outperform the baseline method on all question types on the test set 2 of the HR dataset.

E. Comparisons on the RSIVQA Dataset
For the RSIVQA dataset, we take MAIN proposed in [43] as the baseline method.MAIN consists of two modules: a representation module and a fusion module.Image features and question representations are first learned by the representation module.Then, mutual attention and bilinear fusion are utilized to fuse image and question representations in an adaptive manner.The comparison between MAIN and our approach is presented in Table IV.On this dataset, according to difficulty levels of question types, prior weights W q i are set as {yes/no: 1.0, others: 2.0, number: 3.0}.Compared with MAIN, our proposed method can achieve much better performance on the yes/no and others types.Although the accuracy of our model on the number type is lower than that of MAIN, the proposed method can obtain better performance in general on the RSIVQA dataset.This demonstrates that the proposed training strategy and multi-level feature learning modules are effective for the RSVQA task.

F. Ablation Study and Discussion
To evaluate the proposed framework more comprehensively, the following two ablation studies are conducted to explore the effect of different sub-modules.Specifically, to show the superiority of multi-level visual features, CGA and CST modules are compared separately.Quantitative results are presented in Table V.The performance of the model with CGA on three question types is better than that of the baseline method except rural/urban type, which indicates that cross-modal global attention mechanism can be used to enhance the distinguishability of visual features.Since the cross-modal spatial transformer enables more flexible visual feature extraction, the model with CST can achieve better results (except rural/urban type) than both baseline method and the model with CGA.Finally, by combining both modules, MLL model obtains better or competitive performance among these competing methods.This demonstrates the superiority of using multi-level visual features in the RSVQA task.
To further study the effect of different numbers of training samples on model performance, we have conducted experiments by training the proposed SPCL+MLL model with different proportions of the training data.Specifically, 10%, 40%, 70%, and 100% training samples are used for model training.The results in Table VI show that the performance of the model becomes better as the number of training samples increases gradually.In addition, we can also see that the proposed method works well with different sizes of training data.

V. CONCLUSION
In this paper, two challenges for the RSVQA task are addressed.First, there are no object annotations available in RSVQA datasets, while these annotations can provide rich semantic information for answering questions.Aiming at this challenge, a multi-level visual feature learning method is proposed to jointly learn the language-guided holistic feature and region feature.Specifically, CGA and CST modules are devised to extract flexible visual features with different spatial contexts.Second, questions in RSVQA datasets are with clearly different difficulty levels for the same remote sensing image.Directly training a model with questions in a random order may confuse the model and limit the performance.In order to alleviate this problem, we propose a languageguided SPCL method with a soft weighting strategy to train networks with samples in an easy-to-hard way.Extensive experiments are conducted on three public RSVQA datasets, and experimental results show that the proposed method can achieve state-of-the-art performance.
Fig.2.Main architecture of the proposed VQA method.1) Firstly, multi-modal features are extracted from the two types of inputs, including visual features from the given image and language features from the question; 2) Then, visual features and language features are somehow fused to get the multi-modal representation; 3) Finally, the answer is predicted via a classifier.

Fig. 3 .
Fig. 3. Illustration of the proposed CGA and CST visual feature learning modules.Using the proposed two modules, cross-modal global and spatiallytransformed visual features can be learned jointly.

Fig. 5 .
Fig. 5. Visualization of the precisions of different question types for SPCL on the LR dataset.The precisions of different question types during the first 50 training epochs are displayed by different colors of lines.

Fig. 6 .
Fig. 6.Illustration of some qualitative examples on the LR (the upper two rows) and HR (lower two rows) datasets.The first column is the input image.The corresponding questions and predicted answers are presented bellow each image.Global attention maps are displayed in the second column.For the last two columns, the spatial transformation is shown.(Best viewed in color.)

TABLE I COMPARISONS
ON THE LR DATASET.BOTH THE MEAN VALUE AND THE STANDARD DEVIATION ARE REPORTED.

TABLE II COMPARISONS
ON THE TEST SET 1 OF THE HR DATASET.BOTH THE MEAN VALUE AND THE STANDARD DEVIATION ARE REPORTED.

TABLE III COMPARISONS
ON THE TEST SET 2 OF THE HR DATASET.BOTH THE MEAN VALUE AND THE STANDARD DEVIATION ARE REPORTED.

TABLE IV COMPARISONS
ON THE RSIVQA DATASET.BOTH THE MEAN VALUE AND THE STANDARD DEVIATION ARE REPORTED.Visualization of the proportions of training samples for SPCL on the LR dataset.The proportions of different question types during the first 50 training epochs are displayed by different colors of lines.

TABLE V ABLATION
STUDY ON THE LR DATASET.BOTH THE MEAN VALUE AND THE STANDARD DEVIATION ARE REPORTED.

TABLE VI PERFORMANCE
COMPARISONS OF DIFFERENT NUMBERS OF TRAINING SAMPLES ON THE LR DATASET.