Change Detection Meets Visual Question Answering

The Earth’s surface is continually changing, and identifying changes plays an important role in urban planning and sustainability. Although change detection techniques have been successfully developed for many years, these techniques are still limited to experts and facilitators in related fields. In order to provide every user with flexible access to change information and help them better understand land-cover changes, we introduce a novel task: change detection-based visual question answering (CDVQA) on multitemporal aerial images. In particular, multitemporal images can be queried to obtain high-level change-based information according to content changes between two input images. We first build a CDVQA dataset, including multitemporal image–question–answer triplets using an automatic question–answer generation method. Then, a baseline CDVQA framework is devised in this work, and it contains four parts: multitemporal feature encoding, multitemporal fusion, multimodal fusion, and answer prediction. In addition, we also introduce a change enhancing module to multitemporal feature encoding, aiming at incorporating more change-related information. Finally, the effects of different backbones and multitemporal fusion strategies are studied on the performance of CDVQA task. The experimental results provide useful insights for developing better CDVQA models, which are important for future research on this task. The dataset will be available at https://github.com/YZHJessica/CDVQA.


I. INTRODUCTION
T HE Earth's surface is continually changing by man- made and natural influences.These changes are closely involved in human and social development and also guide the urban planning and sustainability [1].Change detection, aiming at detecting differences of the same region at different Nowadays, change detection technology has been developed significantly, and there are various algorithms with great performance improvement for remote sensing data [13]- [15].Change detection methods can be divided into two main categories, depending on whether or not the types of changes are detected.One is binary change detection that only detects changed regions but ignores the type of changes, e.g., the object-oriented key point vector distance for detecting binary land-cover changes [16] and the end-to-end 2D CNN for hyperspectral image change detection [17].Change maps obtained by such methods are visualized by binary values to depict change information at the pixel level.The other is semantic change detection, for instance, using asymmetric Siamese network for identifying changes via feature pairs [18] and reasoning bi-temporal semantic correlations [19].These methods not only detects changed regions but also identify change types.
Although change detection has great application value, the specialized nature of this task makes change information limited to researchers.It is still difficult for end users to access and understand much of important change information.For instance, ordinary users are interested in a certain change type in a certain region, but it is inconvenient and ineffective for them to find it on change maps in practical applications.Considering this problem, efficient and effective change information interaction with end users becomes important.In this context, nature language processing (NLP) enables computers to understand the text in almost the same way as humans.It is user-friendly and can greatly improve the interactivity between image analysis systems and end users.Therefore, in order to alleviate the interaction issue, the integration of computer vision and NLP [20] has gradually become a hot research topic in the machine learning community.In particular, tasks like visual description generation [21], visual storytelling [22], visual question answering (VQA) [23], [24] and visual dialog [25] have been fully and successfully conducted in computer vision.Similarly, tasks of integrating remote sensing imagery and NLP, such as image captioning and VQA, have also become an active research topic in the field of remote sensing [26], [27].Captioning for remote sensing images was first proposed in [28], and Lu et al. [29] further explored captioning methods using both handcrafted and convolutional features and proposed a new dataset.Recently, a multilayer aggregated Transformer was utilized to extract information for caption generation [30].Regarding VQA for remote sensing data (RSVQA), Lobry et al. [31] first introduced this task, built two datasets, and used a hybrid CNN-RNN model to extract feature, and Yuan et al. [32] proposed a self-paced curriculum learning based model trained from easy to hard questions gradually.
Compared to natural images, aerial images are more specialized due to the top-view perspective and complicated background.As shown in Fig. 1, answers to questions about natural images [23] are more obvious than answers to questions about aerial images [31] in many cases in VQA tasks.Besides, Fig. 1 illustrates that answers to questions about the comparison of multi-temporal aerial images require careful observation and even calculation, which is unfriendly to ordinary users.Though VQA for natural images has been studied for many years and VQA for remote sensing data has also gradually become a research focus, VQA for change detection based on multi-temporal images is under-explored.Considering the significance of change detection task and its values in practical applications, it is vital to investigate how to improve the friendliness and accessibility of change detection systems to end users.Hence, there is also a greater need to develop end user accessible VQA algorithms for multi-temporal remotely sensed data.
In this paper, we introduce the task of change detectionbased visual question answering (CDVQA) on multi-temporal aerial images.Specifically, given two aerial images captured at different times and a natural language question about them, the CDVQA task aims to provide an answer in natural language by comparing the content of two images.To this end, we create a CDVQA dataset by an automatic generation method, which contains 2,968 pairs of multi-temporal images and more than 122,000 question-answer pairs.The questions are carefully designed to cover various types of changes.Moreover, we propose a baseline method for CDVQA task as shown in Fig. 2. To sum up, the main contributions of this work are summarized as follows: • We design an automatic question-answer generation method and create a new CDVQA dataset.Specifically, the proposed dataset contains 2,968 pairs of aerial images and more than 122,000 corresponding question-answer pairs.• A baseline framework for CDVQA task is proposed, and it includes four parts: multi-temporal feature encoding, multi-temporal fusion, multi-modal fusion and answer prediction.In addition, a change enhancing module is proposed to incorporate more change-related information into visual features.• Extensive experiments have been conducted to study effects of different network parts on the CDVQA performance.Particularly, different backbones and multitemporal fusion strategies are investigated.The results provide useful insights on the CDVQA task.The rest of the paper is organized as follows.The detailed information for the construction of CDVQA dataset is introduced in Section II.Section III presents the methodology.Experimental results and discussion are shown in Section IV.Finally, this paper is concluded in Section V.

II. DATASET
Different from the traditional VQA task, CDVQA involves multi-temporal aerial images and requires time series analysis.Taking this into account, we choose the existing semantic change detection dataset SECOND [18] as the basic data to automatically generate a CDVQA dataset.The SECOND dataset collects bi-temporal high resolution optical (RGB) images by several different aerial platforms and sensors, with spatial resolution varying from 0.5 m to 3 m [19].Geographical positions include several cities in China, such as Shanghai, Hangzhou, and Chengdu.It has 4,662 pairs of aerial images with the size of 512 × 512 pixels, and 2,968 pairs are publicly available.Each pair consists of a pre-event aerial image and a postevent image of the same location at different times.Besides, each pair has two labeled semantic change maps at the pixel level, before and after the change.In each semantic change map, non-change region and 6 land-cover classes related to changes including non-vegetated ground (NVG) surface, buildings, playgrounds, water, low vegetation and trees are annotated.The authors of the SECOND dataset declare in their paper that semantic change maps in this dataset are labeled by a team of experts in Earth vision applications and high accuracy is guaranteed.Therefore, the generated questionanswer pairs in this work are highly relevant to the content of image pairs.Overall, this dataset has critical semantic change information of main land-cover classes at pixel-level, which provides sufficient information for generating question-answer pairs for the CDVQA task.In this case, we use the 2,968 openly available pairs as our basic data for the further dataset construction.

A. Multi-temporal Image-Question-Answer Triplets Construction
Formally, in each pair of multi-temporal aerial images, let x t1 ∈ R 3×H×W be the image at time T 1 , and x t2 ∈ R 3×H×W be the image at time T 2 .s t1 ∈ R H×W and s t2 ∈ R H×W denote semantic change maps of x t1 and x t2 , respectively, and each pixel in s t1 and s t2 indicates one semantic class, ranging from 0 to 6. Semantic change maps show changed regions and provide their change types at the pixel level.Background pixels mean non-change regions, which are the same in both s t1 and s t2 for an image pair.Foreground pixels indicate changed regions of different land-cover types.Specifically, the • Change or not.
Change or not for an image pair.The most fundamental yet important information in change detection is about whether a certain land-cover has changed.Note that a change occurs regardless of whether the area of a landcover increases or decreases.For each pair of aerial images, the set of changed land-cover classes L t1 and L t2 are extracted from s t1 and s t2 , respectively.Let l i be a land-cover class, l i ∈ L t1 or l i ∈ L t2 , indicating that the corresponding land-cover type has changed.In this case, the answer should be yes.On the contrary, if l i / ∈ L t1 and l i / ∈ L t2 , it indicates that the corresponding landcover does not change.Then, the answer should be no.All land-cover types are traversed to generate multiple question-answer pairs.Change or not for a single image.For change detection tasks, sometimes one want to focus not only on whether a certain land-cover class has changed, but also on whether changes have occurred in the pre-event image or postevent image.Therefore, we extract semantic change information solely from the first or second image to generate relevant questions and answers.Please note that in this work, the first image in the image pair refers to the preevent/pre-change, and the second image means the postevent/post-change.Particularly, for the land-cover class l i , if l i ∈ L t1 , it indicates that the corresponding landcover has changed on the pre-event image.The answer under this situation should be yes.Similarly, if l i ∈ L t2 , it means that the area of l i has changed on the post-event image.The answer will also be yes.In other cases, i.e. l i / ∈ L t1 and l i / ∈ L t2 , the corresponding answer to the question about whether it has changed on a single image should be no.
• Increase/decrease or not.
Change detection in real-world applications often requires more specific change information, for instance, whether the area of a land-cover has increased or decreased.In this context, we denote the area of l i in s t1 as A i t1 and the area in s t2 as A i t2 .For increasing-related question-answer pairs, if A i t2 − A i t1 > 0, the area of l i increases.Then the answer to this question should be yes.For decreasingrelated pairs, the generation process is similar.If A i t2 − A i t1 < 0, the area of l i decreases.Note that the area of l i is defined as all pixels with label l i in the whole imagery.
• Change to what.
This type of questions involves more detailed information about changes, i.e., what the land-cover at time T 1 mainly becomes to at time T 2 .Such questions require analyzing the same region in multi-temporal images to obtain the change of land-cover types in this region.Although one class may change to more than one class over time, it is more meaningful to focus on the major change.Particularly, for a semantic class, we first find its pixel indices in s t1 .Then, the indices are used to select the corresponding pixels in s t2 .Finally, we count the number of the selected pixels for each land-cover type and choose the type with the largest number as the major change.In this case, the answer to the question what the regions of l i at time T 1 mainly change to should be the major change type.• Largest/smallest change.
Largest/smallest change for an image pair.Such questions focus on the largest or smallest changes in multitemporal images.For each land-cover type, all changes in the two images should be taken into account.Therefore, the changed area for the land-cover class l i is . By traversing all change types, the maximum and minimum changed regions can be obtained, and the corresponding land-cover classes are answers to this type of questions.In this dataset, the smallest change is which has the smallest changed area, and the unchanged type is not considered.Largest/smallest change for a single image.To extract more detailed information about changes, we also analyze the maximum and minimum changed regions for the preevent and post-event image separately.The maximum and minimum changed regions at time T 1 can be easily obtained by arg max li (A i t1 ) and arg min li (A i t1 ), and the selected land-cover l i is the corresponding answer.For time T 2 , the generation process is the same.This type of questions requires a model to not only identify land-cover changes in bi-temporal images but also understand which image (T 1 or T 2 ) is queried by users.In this context, the question "What is the smallest change in the first image?" is actually asking about the land-cover of the smallest changed region in the image captured at an earlier date.• Change ratio.
Change ratio for all land-covers.The percentages of changed regions are also very important information in practical applications.The change ratio can be calculated via dividing the changed area by the total area of the whole map, and the same for non-change ratio.Since proportions are continuous numbers, they cannot be compatible with the classification task.Thus, we discretize ratios into bins.To be more specific, numerical answers are quantized into 11 categories: 0%, 0%-10%, 10%-20%, 20%-30%, 30%-40%, 40%-50%, 50%-60%, 60%-70%, 70%-80%, 80%-90%, and 90%-100%.Notice that in this context A%-B% means (A, B].In this way, we calculate the change percentage for each image pair and gain answers to the change ratio-related questions.
Change ratio for each land-cover.In addition to the ratio of all changed regions, we also want to analyze the change ratio for each land-cover class on the pre-event or post-event image.Similarly, numerical answers are also quantized as above.For each land-cover class l i , we first calculate its changed regions A i t1 and A i t2 at T 1 and T 2 .Then, the change ratio for l i on the pre-event image is calculated via dividing A i t1 by the total area of the whole image.In the same way, change ratios for different landcovers on the post-event image can be obtained.
In practice, we have defined multiple synonymous templates for each type of questions.During the question-answer generation process, for each image pair, question-answer pairs are generated separately for each question type.As more than one template is designed for each question type, we randomly select one of them to generate a sample.To balance the number of samples in each question type, we set different probabilities for generating samples of different question types.Specifically, we set a low probability value for the "yes/no" type and a high probability value for other question types.For each image pair, we generate 16 samples in average.

B. Question and Answer Distributions
As 2,968 pairs of images are publicly available, we use these images as the basic data to generate the CDVQA dataset.The whole dataset is split into the training set, validation set, and test set.To better evaluate the robustness and reliability of CDVQA models, we generate two test sets with different distributions of answers.The class distributions of answers in the generated CDVQA dataset are displayed in Fig. 3. From this figure, we can see that the training set, validation set, and test set 1 share the same class distribution.The answer distributions of test set 1 and test set 2 are different.
As we can see from Fig. 3, answer types in all subsets obey the long-tail distribution.Concretely, answer class no dominates answer distributions in all subsets.For example, in the training set, samples with answer no occupy 30.9% of all instances.In the test set 1, answers no account for 31.15% of total answers.In contrast, answers 50%-60% only occupy 0.22% of all answers.The reason for the class imbalance is that there are more questions asking for yes or no.The answers to questions such as change or not and increase/decrease or not are yes or no.
The question type distributions of all four subsets are presented in Fig. 4. For simplicity, change ratio for each land-cover is denoted as class change ratio.We can see that distributions of question types are also long-tailed.In addition, question type change or not has the highest frequencies in all subsets.This is the reason why the two most frequent answer types are yes and no.Similar to answer distributions, the distributions of question types for the training set, validation set, and test set 1 are the same, while they are different from the distribution of the test set 2. Specifically, the proportions of questions about "change to what", "change ratio", and "class change ratio" increase in test set 2 compared to test set 1. Since questions of these types are more difficult, test set 2 is more difficult than test set 1. Visualization examples of the generated CDVQA dataset are shown Fig. 5.

III. METHODOLOGY
In this work, the CDVQA task is deemed as a classification task.Note that semantic change maps are only used to generate question-answer pairs in the dataset preparation phase, and in CDVQA, only image pairs, questions, and the corresponding answers are used for training and evaluating a model.As shown in Fig. 2, our CDVQA model takes as input two aerial images and a question.The output of the model is an answer predicted by the network.Particularly, the whole network architecture consists of four parts.The first component is a multi-temporal visual feature learning module which is used to encode the input images into deep features.The second part, named multi-temporal fusion, is responsible for fusing the features of the two images.The third one is a multimodal fusion module that aims at fusing the image and question features.The forth is an answer prediction part, which takes the fused multi-modal feature as input to predict the answer.In addition, for CDVQA task, we design a change enhancing module to encourage the model to focus on changed pixels of the input images.The proposed modules in our CDVQA framework will be described in detail in the following subsections.

A. Multi-temporal Encoder
Different from tasks like image classification, object detection, and semantic segmentation, change analysis involves two input images of the same location but at different times.Similarly, a CDVQA system takes as input multi-temporal inputs.In order to identify changes between two images, temporal differences should be extracted and analyzed.
In respect of multiple inputs, Siamese networks are commonly used in many vision tasks.We denote the feature of the image of time T 1 as F 1 = f 1 (x t1 ).Likewise, f 2 (•) is used to obtain the encoded representation for the image of time T 2 .For Siamese networks, we set the network architecture and parameters of f 1 and f 2 to be the same.
In this work, we explore effects of different encoder networks on CDVQA.For visual feature extraction, convolutional neural networks (CNNs) are usually used to learn feature representations, and ResNet [33] is an important milestone in the development of CNN architectures.Thus, different scales of ResNets, e.g., ResNet-18, ResNet-101, and ResNet-152, are employed as the multi-temporal encoder of our CDVQA model, aiming at studying effects of different scales of CNNs on CDVQA.
Recently, Transformer architecture [34] has achieved excellent performance on NLP tasks [35].Designed for sequence modeling tasks, Transformer has the significant advantage of using attention to learn long-range dependencies in data.Considering its great success in the language modeling domain, it has also been applied to computer vision tasks, to name a few, image classification [36], [37], object detection [38], and semantic segmentation [39].In this work, the Transformerbased encoder for multi-temporal images is also used.

B. Change Enhancing Module
Change detection is a fundamental task in remote sensing and also the core of CDVQA task.To answer change-related questions, a model needs to focus on changed regions and further analyze semantic information.In a number of computer vision tasks, self-attention mechanism [40]- [42] is used to boost the performance by focusing on important parts of data samples.However, there are two input images in our case, where the self-attention mechanism is not applicable.Hence, in this work, we propose a change enhancing module to enhance the CDVQA model in terms of the capability of detecting changes.
We denote that the encoded deep features for the input two images are F 1 ∈ R N ×C×H×W and F 2 ∈ R N ×C×H×W , respectively, where N is batch size, C is the number of channels, and H, W are the height and width of feature maps.The conventional self-attention model [34] first transforms the input feature into three independent features, i.e., the query Q, key K, and value V .In contrast, for the proposed change enhancing module, we treat the feature representations F 1 and F 2 as the query and key, respectively, and compute their similarity F s ∈ R N ×C×H×W as follows: where f q (•) and f k (•) are 1×1 convolutions for the purpose of feature transformation.Next, a change enhancing map M ce ∈ R N ×H×W can be obtained by: where f c (•) is a 1 × 1 convolution layer for predicting the change enhancing map.σ(•) is ReLU activation function.The map M ce is used to encourage the model to focus on regions where differences between F 1 and F 2 are large.To this end, we scale M ce with a parameter θ and add it with an identity matrix I. θ is a learnable parameter with an initial value of 0 and is optimized during training in an end-to-end manner.Then, we multiply the transformed M ce and two encoded features, respectively: where F c1 and F c2 are the final encoded features corresponding to two input images.F c1 and F c2 will then be fused by the multi-temporal feature fusion module, which will be introduced in the next subsection.

C. Multi-temporal Fusion
After the feature encoding and change enhancing processing, we need to fuse features of time T 1 and time T 2 to obtain the final visual feature F v .For the fusion of multiple feature maps, element-wise subtraction, multiplication, summation, and concatenation are commonly used methods.Given two feature maps F c1 ∈ R N ×C×H×W and F c2 ∈ R N ×C×H×W , the aforementioned fusion methods can be formulated as: where denotes element-wise subtraction operation.Note that we normalize the two features before the element-wise subtraction operation for computing F v2 .⊕ and ⊗ denote element-wise summation and multiplication operations, respectively.
stands for the concatenation operation along the channel dimension.To study effects of different fusion strategies, we compare and analyze their performance in Section IV.

D. Multi-modal Fusion
Since CDVQA involves both visual features and language representations, we need to fuse multi-modal features.After the multi-temporal feature fusion, the final visual representation F v ∈ R N ×C×H×W can be obtained.Meanwhile, a recurrent neural network (RNN) is used to encode the question into feature vector V q ∈ R N ×L .As the skip-thoughts model has been applied in many remote sensing image-based NLP tasks [31], [43], we choose to use the pre-trained skip-thoughts model [44] for the language feature extraction part.Specifically, skip-thought vectors are modeled with an encoderdecoder architecture, and both are constructed with RNNs.
The encoder transforms the input sentence into a vector, and two decoders are used to decode the vector into the previous and the next sentence, respectively.In this work, we use the encoder of skip-thoughts for generating language embeddings.Before fusing features of two modalities, we first transform the visual feature F v into a feature vector F vt ∈ R N ×L .Then, the two feature vectors have the same size, and we can fuse F vt and V q together.As how to fuse them is not the main research content of this work, we simply merge them into a multi-modal feature by concatenation: where F m ∈ R N ×2L is the fused multi-modal representation.Finally, as the answer prediction is modeled as a classification task in this work.The feature F m is used to predict the answer by passing through a classifier, i.e., two fully connected layers.The answer is given by selecting the answer class with the highest probability.The output dimension of the first layer is 256 and the final output dimension of the classifier is 19, as there are 19 answer types.Specifically, the possible answers include no, yes, 0%-10%, 0, NVG surface, buildings, low vegetation, 10%-20%, trees, 20%-30%, water, 80%-90%, 30%-40%, 90%-100%, 70%-80%, 40%-50%, 60%-70%, 50%-60%, and playgrounds (sorted by the number of samples).

A. Datasets
The CDVQA dataset is publicly available in 2,968 image pairs with the size of 512 × 512.Based on these image pairs, there are more than 122,000 question-answer pairs generated in total.The training, validation, and test sets are split based on image pairs captured at different geographical positions.Particularly, the training set contains 65,967 question-answer pairs, which are generated from 1,600 (53.91%) image pairs.There are 16,441 question-answer pairs in the validation set, which are produced based on 400 (13.48%) image pairs.

B. Implementation Details
The generated dataset for CDVQA follows the same format as the work of RSVQA [31].Regarding training parameters, Adam optimizer is used with an initial learning rate of 1e-4.For all ResNet-based models, the batch size is set to 70, and the size of the input image is scaled to 256 × 256.Since the used ViT [37] model requires the input size to be 384×384, we have to reduce the batch size to 32 considering GPU memory limit.For all experiments, 50 epochs are used to train models.We utilize accuracy as a measurement for each question type.Additionally, average accuracy and overall accuracy are also reported.

C. Effects of Different Backbones
The backbone network of the visual encoder is an important component.Therefore, we compare four different backbones: three ResNets (ResNet-18, ResNet-101, and ResNet-152) and a vision Transformer model ViT.In all experiments, we fuse multi-temporal visual features by feature concatenation for all backbone networks.
The results on two different test sets are displayed in Table I and Table II, respectively.From the results, we can see that compared to ResNet-18 and ResNet-101, ResNet-152 does not show a significant performance advantage.For example, on the test set 1, ResNet-18 and ResNet-152 deliver very close average and overall accuracies.This indicates that merely improving the capability of backbone network for visual learning only yields a limited gain.However, when we change the network architecture of the backbone from ResNet to Transformer, the performance can be further improved.The reason for this improvement is that the self-attention mechanism of Transformer networks is beneficial for learning more representative features.Note that parameters of backbone networks are fixed during the training stage.In Fig. 6 and Fig. 7, we also visualize training and validation losses of models with different backbones.It can be seen that the ViT backbone has significantly lower losses than ResNet-based networks.Note that we omit the first 5 epochs to better compare the final convergence state.
From the results, we can see that different backbone networks have very little impact on the performance of our framework.This is because visual feature learning may not be the key to improving accuracy.Other parts of the model, such as multi-temporal fusion and change analysis part, may be more critical for the performance improvement of the CDVQA task.

D. Effects of Different Multi-temporal Fusion Strategies
In this subsection, we quantitatively compare five commonly-used feature fusion operations, namely concatenation (Concat), summation (Sum), subtraction (Sub), normalized subtraction (NSub), and multiplication (Mul).The numerical results on two test sets are presented in Table III and Table IV.The results in these tables show that concatenation is the best.The concatenation operation first concatenates two inputs together, and then several fully-connected layers are  used to fuse these inputs by learnable weights.This makes it a more flexible and general fusion strategy.For change analysis tasks, intuitively, subtraction should be the best fusion method, as it can better highlight changed regions.However, it can be seen from the results that the subtraction operation cannot outperform others.Considering that the direct subtraction of two features may undermine their representability, we normalize two input features by using 2 normalization before the subtraction operation.Nevertheless, the normalized subtraction operation is still no better than concatenation and summation.This indicates that directly subtracting two inputs is not useful to CDVQA tasks, and a specific change analysis module should be designed.

E. Effect of Change Enhancing Module
It is critical to obtain semantic change information from multi-temporal images.However, there are no pixel-wise ground truth change labels available in this task.To incorporate change information into the model, we propose a change enhancing module to highlight changed regions in the input images.To validate the effectiveness of the module, we conduct an ablation study, and numerical results are displayed in Table V and Table VI.In the two tables, change enhancing module is abbreviated as CEM for the sake of simplification.The experimental results on both test sets indicate that the proposed change enhancing module is beneficial to the CDVQA task.In particular, from the results in two tables, it can be seen that the proposed module can consistently improve both the average accuracy and overall accuracy.

F. Cross-dataset Evaluation
In order to explore the generalization ability of the model, we construct another CDVQA dataset as an additional test set.Specifically, we collect 138 image pairs of size 256×256 from HTCD [45] dataset (only binary change maps are available) and manually annotate semantic change maps.Then, 3,303 question-answer pairs are generated and used for the crossdataset test setting.To show the effectiveness of the proposed method, we compare the performance of a model trained on the CDVQA dataset and another model with randomly initialized weights.Table VII shows numerical results.We can see that the model trained on our CDVQA dataset can be transferred to unseen scenarios, but its performance is not satisfactory in this cross-dataset test setting.This is mainly because there is a domain gap between the tests set of CDVQA and this one.We see that much more research efforts are needed in this direction.

G. Discussion
Regarding numerical results, generally, the average accuracy is lower than 60%, and the overall accuracy is lower than 70%.Some visualization examples of CDVQA results are presented in Fig. 8.Both correctly predicted examples and failures are displayed.From the experimental results, we can conclude that CDVQA is a complex and challenging task.To correctly answer different types of questions, a model first needs to learn multi-modal representations for the input images and questions.Visual and language understanding is of great importance for the model.Besides, CDVQA also requires the model to be able to analyze semantic change information.I.e., the model needs to not only locate changed areas, but also identify land-cover classes of changed regions to answer some complex questions.Currently, the proposed baseline framework does not make use of semantic change labels.Thus its performance on questions related to landcover classes is not that satisfactory.Change ratio for each land-cover has higher accuracy than change ratio for all landcovers.This is mainly because the former has more training samples.We also visualize the normalized confusion matrix in Fig. 9.Note that the confusion matrix is normalized along the predicted label axis.
To better understand what the model has learned for making decisions, we visualize attention maps of our model on some examples in Fig. 10.It can be seen that the model learns to focus on the related changed regions to predict answers.From experimental results, we can conclude that more research efforts are needed to reach a satisfactory performance on the challenging CDVQA task.Specifically, more effective change analysis-based visual learning methods should be investigated.We also see that Transformer-based models have great potential for multi-temporal and multi-modal feature learning in CDVQA tasks.Additionally, self-supervised or unsupervised change detection methods need to be studied.How to obtain the semantic change information from multi-temporal data in    an unsupervised manner is also an important research direction in CDVQA tasks.

V. CONCLUSION
To provide ordinary end users with flexible access to change information, we introduce a new task named CDVQA with natural language as output.This task takes multi-temporal aerial images and a natural question as inputs to predict the corresponding answer.To be specific, we create a new dataset, which contains 2,968 pairs of aerial images and more than 122,000 question-answer pairs.In addition, a baseline CD-VQA model is devised, and different components of models are evaluated on the generated dataset.The experimental results outline possible problems that are needed to be addressed for the CDVQA task.This work also provides some useful insights for developing better CDVQA models, which are important for future research in this direction.

Fig. 1 .
Fig. 1.Examples of questions for natural imagery, aerial imagery, and multitemporal aerial images in VQA tasks.

Fig. 3 .Fig. 4 .
Fig. 3. Visualization of answer distributions of different subsets.From left to right: training set, validation set, test set 1, and test set 2.

Fig. 8 .
Fig. 8. Visualization examples of CDVQA results.Each row presents three different questions and the same input image pair.Correctly predicted results are shown in blue, and wrong answers are in red.

Fig. 9 .
Fig. 9. Normalized confusion matrix for our CDVQA dataset on the test set 1 (ResNet-152 is used as the backbone).
The main architecture of the proposed CDVQA framework.It contains four main parts: multi-temporal feature encoding, multi-temporal fusion, multi-modal fusion, and answer prediction.value of the pixel in s t1 indicates the semantic class at T 1 and the value of the pixel in s t2 indicates the semantic class at T 2 .

TABLE I NUMERICAL
RESULTS OF USING DIFFERENT BACKBONE NETWORKS ON THE TEST SET 1 OF CDVQA DATASET.

TABLE II NUMERICAL
RESULTS OF USING DIFFERENT BACKBONE NETWORKS ON THE TEST SET 2 OF CDVQA DATASET.
Fig. 7. Visualization of validation losses.Four different backbone networks are compared.

TABLE III NUMERICAL
RESULTS OF USING DIFFERENT FUSION STRATEGIES ON THE TEST SET 1 OF CDVQA DATASET.

TABLE V ABLATION
STUDY ON THE TEST SET 1 OF CDVQA DATASET FOR RESNET-101 BACKBONE.

TABLE VI ABLATION
STUDY ON THE TEST SET 2 OF CDVQA DATASET FOR RESNET-101 BACKBONE.

TABLE VII EXPERIMENTAL
RESULTS IN THE CROSS-DATASET TEST SETTING.