SBVQA 2.0: Robust End-to-End Speech-Based Visual Question Answering for Open-Ended Questions

Speech-based Visual Question Answering (SBVQA) is a challenging task that aims to answer spoken questions about images. The challenges of this task involve the variability of speakers, the different recording environments, as well as the various objects in the image and their locations. This paper presents SBVQA 2.0, a robust multimodal neural network architecture that integrates information from both the visual and the speech domains. SBVQA 2.0 is composed of four modules: speech encoder, image encoder, features fusor, and answer generator. The speech encoder extracts semantic information from spoken questions, and the image encoder extracts visual information from images. The outputs of the two modules are combined using the features fusor and then processed by the answer generator to predict the answer. Although SBVQA 2.0 was trained on a single-speaker dataset with a clean background, we show that our selected speech encoder is more robust to noise and is speaker-independent. Moreover, we demonstrate that SBVQA 2.0 can be further improved by finetuning in an end-to-end manner since it uses fully differentiable modules. We open-source our pretrained models, source code, and dataset for the research community.


I. INTRODUCTION
Question Answering (QA) is the field concerned with building systems that automatically answer questions posed by humans in a natural language.Question answering systems play an important role in web search engines, where the engine is required to understand the question of the user to fetch the best answer from the database [1].The Visual Question Answering (VQA) field is an extension of its predecessor field of QA.In VQA, the system is provided with two kinds of inputs: an image and a textual question about information in the given image.A typical VQA system is composed of four major modules: image encoder, question encoder, features fusor, and answer generator where each module is responsible for a specific task.The first component The associate editor coordinating the review of this manuscript and approving it for publication was Wai-Keung Fung . is the image encoder which is used to convert the image into one or more feature vectors that represent the important information in the image such as the types of objects, the locations of them with respect to the image, and the relationship between them.The second component is the question encoder that is used to encode the text of the question into a feature vector or a feature matrix which captures the semantics of the question.The third component is the features fusor which operates on the features generated by both the image encoder and the question encoder to form the final set of features that will be fed to the answer generator.The last component is the answer generator module that uses the output of the features fusor and learns to predict the correct answer.A further extension of the VQA field is SBVQA.In SBVQA, the text-based question encoder is replaced with a speech-based question encoder where the question is given as an audio signal instead of text.This comes with some challenges related to the nature of speech recognition tasks such as the variability of speakers, the background noise in the recording environments, and the different contents presented in the speech.For a robust speech-based question encoder, it is necessary to overcome all of these challenges.This paper addresses these issues by selecting an appropriate speech encoder that can capture the semantics of the speech regardless of these variabilities.

II. RELATED WORK
The earliest known SBVQA system was a mobile application called VizWiz [2].It was developed to assist blind and visually impaired people answering daily visual questions.The main idea was to allow users to take pictures, record spoken questions about these pictures, and submit them to a server through a mobile application.On the server-side, there were human workers who analyzed the pictures and answered the questions accordingly.In 2015, the VQA 1.0 [3] dataset was released to address the task of free-form and open-ended questions in VQA.It contains more than 6.1M questions with 10 answers per question.These questions were developed based on information from ∼205K images that cover a wide range of real life scenarios.A well-known problem of the VQA 1.0 dataset is the presence of language priors that cause the model to achieve high accuracy score depending only on the question [4].Specifically, models trained on this dataset tend to learn these language priors and answer questions based on them rather than learning to answer based on both information from the image and the question.Goyal et al. [4] developed the VQA 2.0 dataset to balance questions in the VQA 1.0 dataset by collecting complementary images such that every question has different images with different answers to mitigate the effect of the language priors.In 2017, Zhang et al. [5] extended the idea of VQA to answer spoken questions instead of the textual questions which they called Speech-based Visual Question Answering (SBVQA).The only difference between VQA and SBVQA was to replace the text-based question encoder with a speech-based question encoder.They introduced two types of speech encoders: an Automatic Speech Recognition (ASR)-based speech encoder, and an End-to-End (E2E) speech encoder.The ASR-based speech encoder first transcribes speech into text using an ASR system, and then feeds the transcription into a text-based encoder.In contrast, the E2E speech encoder can be jointly trained or finetuned with the SBVQA pipeline to directly predict the final answer.Figure 2 illustrates ASR-based and E2E speech encoders.
All of the questions in VQA 1.0 and VQA 2.0 datasets are textual questions.To the best of our knowledge, Zhang et al. [5] created the first VQA dataset with spoken questions based on the VQA 1.0 dataset [3] which contains about 200 hours of synthetic speech data and 1 hour of real speech data that they open-sourced.In this paper, we refer to this dataset as the SBVQA 1.0 dataset.Another dataset called Fact-based Visual Spoken-Question Answering (FVSQA) [6] was released in 2021 that has ∼5 hours of synthetic spoken questions for each of the three languages: English, Hindi, and Turkish.The authors of this dataset also developed a model with an E2E speech encoder consisted of an LSTM layer with MFCC features.Recently, Tang et al. introduced Textless Vision-Language Transformer (TVLT) [7], a multimodal encoderdecoder transformer pretrained on 1.85M video clips.TVLT learned to align video frames with their corresponding audio spectrograms using two objective functions: Vision-Audio Matching (VAM) and Masked AutoEncoding (MAE) [8].Moreover, TVLT was finetuned on a spoken version of the VQA 2.0 dataset to support SBVQA task.
The proposed E2E speech encoders in SBVQA 1.0 [5], FVSQA [6], and TVLT [7] are simple encoders that are prone to issues related to the variability of speech, the variability of speakers, and the different recording backgrounds.Speech encoding is considered as a complex task due to the aforementioned challenges [9].Such a complex task requires a large speech encoder that is capable of abstracting information from the spoken question regardless of the speaker or the level of noise in the recording environment [10].
In this paper, we seek to develop a robust E2E SBVQA model that addresses the challenges of speech encoding which previous systems could not solve.First, we describe a selection process to identify the best speech encoder from a given list of pretrained speech encoders.Then, we explore multiple image encoders to find the best encoder that helps the model achieve the highest accuracy.After selecting the best speech and image encoders, we will answer the following research questions to validate our final model architecture: • RQ1: Does the model utilize both visual and speech information?
• RQ4: Is the E2E finetuning effective?The contribution of this paper is twofold: 1) we built the SBVQA 2.0 dataset that includes a multispeaker subset.2) we built a fully differentiable E2E SBVQA model that is speaker-independent and noise-robust.We released our source code, dataset, and pretrained models for the research community. 1

III. DATA COLLECTION
The VQA 2.0 dataset [4] was built to balance the VQA 1.0 dataset [3] where each question is associated with multiple different images and answers to force the model to look at the image in order to answer the question.Zhang et al. [5] built the SBVQA 1.0 dataset based on the VQA 1.0 dataset [3] which indicates that it inherits the issues of the VQA 1.0 dataset.In this work, we built the SBVQA 2.0 dataset based on the VQA 2.0 [4] and SBVQA 1.0 [5] datasets as follows: 1) we built a lookup table that maps each unique textual question in the VQA 1.0 dataset [3] to the corresponding spoken question from the SBVQA 1.0 [5] dataset.2) we used the lookup table to find the spoken question for each textual question in the VQA 2.0 dataset [4].3) we collected all textual questions from the VQA 2.0 dataset [4] that didn't have corresponding spoken questions in the lookup table and synthesized them using Amazon Polly.The original SBVQA 1.0 dataset was created using the voice of a single speaker called Joanna 2 which we also used to build the SBVQA 2.0 dataset.
To test the speaker variability effect on the speech encoder, we randomly sampled 3,750 unique questions from the validation set (val2014) of the SBVQA 2.0 dataset, which represent a total of 39,132 questions, and converted them into spoken questions by 8 different speakers other than Joanna.Those speakers were not used in the validation set during training.Instead, we used them to evaluate the model on unseen speakers to see the effect of changing the speaker on the model.We refer to this dataset as ''multi-speaker dataset''.Table 1 shows the details of all speakers in the SBVQA 2.0 dataset.
into the multi-speaker dataset.We created 5 versions of the multi-speaker dataset using different levels of Signalto-Noise ratio (SNR) based on the following formula: SNR dB = 10 log 10 P signal P noise (1) where P signal is the average power of the signal and P noise is the average power of the noise.SNR dB = 0 means that the average power of the signal (P signal ) and the average power of the noise (P noise ) are equal.In general, the higher the SNR dB , the better the quality of the final audio file.

IV. SBVQA 2.0 ARCHITECTURE SELECTION
Since each component contributes to the overall performance of the model, the selection of the appropriate candidate for each component is crucial.In the following sections, we describe how we selected each component that led us to the final SBVQA 2.0 architecture shown in Figure 1.

A. SPEECH ENCODER SELECTION
We selected 3 pretrained E2E speech recognition models that achieved state-of-the-art results in recent years: WavLM [12], Unispeech [13], and Conformer-L [14].Both WavLM and Unispeech have two versions: pretrained and finetuned models.For all of these models, we only used the encoder part to extract the embedding vectors of the spoken questions.
Each embedding vector represents a phoneme or a part of a phoneme which implies that each spoken question has a matrix of embeddings.We hypothesized that the best model that will be selected as the speech encoder should satisfy the following two conditions: 1) It should cluster the embedding vectors of the same spoken question of different speakers in well-shaped clusters.A cluster is considered well-shaped if all the embedding vectors that represent the same phoneme have very small distances between each other.This indicates that the model has learned to produce the same representations of the given utterance regardless of who spoke it.
2) The clusters should be as separate from each other as possible to minimize confusing embedding vectors of different phonemes with each other.To test these conditions, we sampled 100 unique textual questions from our multi-speaker dataset where each textual question has 9 spoken versions by 9 different speakers.We extracted the embedding vectors for each spoken version of the same textual question, combined all of them together in one matrix, and projected them onto a 2D space using UMAP [15] algorithm.After that, we clustered the projections of the embedding vectors using K -means algorithm.To find the optimal number of clusters (K ), we computed the Silhouette [16] score for each value of K ranging from 2 to N 2 where N is the number of combined embedding vectors of all speakers for the same utterance.The range of Silhouette score values is from −1 to 1.When the Silhouette score approaches 1, it indicates that the clusters are dense and well-separated.In contrast, a Silhouette score approaching 0 implies that the clusters are overlapping, while a Silhouette score near −1 means wrong cluster assignments to data points [16].
We repeated the same previous steps to find the optimal K and its corresponding Silhouette score for all of the 100 unique questions using each speech encoder.Finally, we calculated the mean Silhouette score for each speech encoder by averaging the Silhouette score associated with the optimal K for all 100 unique questions.Table 2 shows the mean Silhouette score of the 100 unique questions for each speech encoder with a confidence interval of 99%.The results show that Conformer-L was the best candidate among all suggested speech encoders.One reason could be the training strategy and the training dataset.Conformer-L was trained on 24,500 hours of transcribed English speech [17], whereas the other two models were trained mainly using selfsupervised techniques.Another reason could be attributed to the Conformer architecture itself.Conformer is composed of multiple layers where each layer has a multi-head selfattention (MSA) layer followed by a convolutional (Conv) layer.MSA and Conv layers have been proven to behave as low-pass and high-pass filters, respectively [18].Therefore, they enable the network to select the appropriate frequency band in each layer.This gives Conformer an advantage over the other two models which only use Conv layers as feature extractors.Appendix A shows UMAP projections of the embeddings of a sample utterance extracted by all suggested speech encoders.

B. IMAGE ENCODER SELECTION 1) GENERAL EXPERIMENTS
To determine the best image encoder for our SBVQA 2.0 model, we selected 6 candidates: Faster R-CNN [19], CLIP [20], DETR [21], RegionCLIP [22], a hybrid DETR-CLIP encoder, and BLIP [23].First, we used Faster R-CNN image encoder to generate bottom-up features [24], which are known to be the de facto standard vision features for vision-language (VL) tasks, and we used them to train our SBVQA 2.0 baseline model.The main issue with Faster R-CNN is that it is not fully differentiable.The reason is that the Region Proposal Network (RPN) in Faster R-CNN uses Non-Maximum Suppression (NMS), which is a non-differentiable operation and thus hinder the E2E training of the model [25], [26].We trained another SBVQA 2.0 model using the original image encoder of CLIP [20] where it extracts a single vector that represents the whole image.Moreover, we trained SBVQA 2.0 model on grid features based on the last feature map of the backbone of CLIP's image encoder.We found that the grid features were not helpful and the model scored the lowest accuracy compared to using other image encoders.After that, we did another experiment to use the transformer of DETR as an image encoder by extracting the last hidden states of the decoder part and considered them as image features.The intuition behind it was that the transformer module of DETR has learned to extract useful latent representations from the image due to DETR's objective function, which aims to locate and recognize objects in the image.Next, we created a hybrid image encoder by combining CLIP's ability to summarize the image into a single vector with DETR's object detection capabilities.The hybrid features were constructed by simply concatenating the feature vector generated by CLIP with each feature vector produced by DETR.Furthermore, we used RegionCLIP [22], which operates on multiple smaller regions of the image instead of the whole image at once, to extract image features and we trained another SBVQA 2.0 model.This encoder also is not fully differentiable because of the NMS operation.Finally, we trained an SBVQA 2.0 model that used the classification (CLS) token of the vision transformer (ViT) [27] from BLIP [23] model as the image feature vector.
Our experiments showed that the image encoder of BLIP [23] model was the best candidate among all other encoders.140970 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.It scored the highest accuracy among all fully differentiable image encoders.Table 3 shows the results of all SBVQA 2.0 models that we trained based on the selected image encoders.

2) BLIP EXPERIMENTS
BLIP's paper [23] proposed a variant of the noisy-student technique [28] called ''CapFilt'' to filter out bad captions that don't match their images.BLIP has two training phases: the pretraining phase, and the finetuning phase.In the pretraining phase, BLIP was trained to perform three tasks: learning a similar embedding space for both images and texts, predicting if an image matches a given caption, and generating a caption for a given image.The authors trained four versions of ViT in BLIP: ViT-Base (ViT-B) trained on 14M images, ViT-B trained on 129M images, ViT-B with CapFilt trained on 129M images, and ViT-Large (ViT-L) trained on 129M images.In the finetuning phase, the authors finetuned BLIP on different downstream tasks like VQA and image captioning (IC).In our work, we explored different models of BLIP with different configurations, i.e., we explored pretrained and finetuned models with different ViT sizes and different input sizes.We found that BLIP with ViT-L finetuned on IC with input size 480 × 480 achieved the best accuracy.We selected it to be the image encoder module for our SBVQA 2.0 model.Table 4 shows the results of SBVQA 2.0 models after training on different versions of BLIP.The results of our experiment also show that it is possible to transfer knowledge from pretrained text-based models to speech-based models, as demonstrated by the high performance of SBVQA 2.0 model after using the finetuned version of BLIP on the IC task as an image encoder.

C. FEATURES FUSOR SELECTION
We used two features fusors: top-down attention [24] and Hadamard (element-wise) multiplication.The choice of the features fusor depends mainly on the number of feature vectors produced by the image encoder.Some image encoders produce a feature matrix and others produce a single feature vector.We applied top-down attention with feature matrices generated by Faster R-CNN [19], DETR [21], RegionCLIP [22], and the hybrid DETR-CLIP image encoders.On the other hand, we used Hadamard multiplication with feature vectors produced by CLIP [20] and BLIP [23] image encoders.For all of our experiments, we used a single unidirectional Gated Recurrent Unit (GRU) cell with an input size of 512 and an output size of 1024 to summarize the feature matrix of the spoken question into a single vector.

D. ANSWER GENERATOR SELECTION
We considered the SBVQA task as a classification problem where the SBVQA 2.0 model should predict the correct answer from a set of predefined answers.The classes were obtained by filtering out the answers that appeared less than 9 times in the combined training and validation datasets.The total number of classes after filtration was 3129 classes.We constructed the answer generator using a simple feed-forward classifier that consisted of a single hidden layer with 2048 neurons and an output layer with 3129 neurons.

V. EVALUATION
We conducted multiple experiments with our final architecture shown in Figure 1.To help answering our research questions, we implemented the SBVQA 1.0 [5] model and we trained it on both the SBVQA 1.0 and SBVQA 2.0 datasets.We set the number of output classes to 1000 in our SBVQA 1.0 implementation to match the original paper's implementation.Moreover, we built both val2014 and test2015 datasets using the original TVLT speaker for fair comparison with TVLT-VQA model.We used the official TVLT-VQA model and checkpoint, released by the authors, in all of our comparisons.

A. RQ1: DOES THE MODEL UTILIZE BOTH VISUAL AND SPEECH INFORMATION?
Since the Hadamard features fusor multiplies the outputs of both the speech encoder and the image encoder to produce the final feature vector, we replaced the output of the image encoder with a vector of ones.This approach forces the model to be blind which helps assessing the contribution of the image encoder to the whole model.We repeated the same process with the speech encoder to force the model to be deaf.Also, we followed a similar approach to evaluate TVLT model on both blind and deaf modes.Tables 5, 6, and 7 show the evaluation results of all models on val2014 dataset.The results show that all models were utilizing their image and speech encoders to answer questions.Moreover, they show that a blind model could predict the answer better than a deaf model which implies that the impact of the speech encoder on 140972 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the overall performance of the model is large than the impact of the image encoder.
To see the effect of SBVQA 2.0 dataset on reducing the language priors of the SBVQA 1.0 dataset, we trained two SBVQA 1.0 models on both datasets.We can see from  the influence of the inherited language priors from the VQA 1.0 dataset.
Appendix B shows the visualization of some sample images after asking different questions on each image.
We used Grad-CAM [29] to visualize the image regions that the SBVQA 2.0 model looked at when answering the question.Also, we show visualizations of self-attention heads of the same images in Appendix C.
140974 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. RQ2: IS THE SPEECH ENCODER SPEAKER-INDEPENDENT?
For the model to be deployed in production, it is necessary that it can generalize to different speakers other than the original speakers that it was trained on.To verify that, we evaluated the performance of SBVQA 1.0, SBVQA 2.0, and TVLT models on unseen speakers.Figure 3 and Figure 4 show the evaluation results of all models on the multi-speaker dataset.We can see from both figures that the SBVQA 2.0 model scored almost the same accuracy across all speakers which indicates that it is speakerindependent.In contrast, the SBVQA 1.0 model had a dramatic drop in accuracy on speakers other than Joanna.Likewise, TVLT had poor results on other speakers although it was pretrained on large datasets with various speakers.The results conclude that both SBVQA 1.0 and TVLT models are speaker-dependent.

C. RQ3: IS THE SPEECH ENCODER NOISE-ROBUST?
Different recording environments can affect the performance of any speech-based model.Robust models have higher resistance to various background noises.To assess the robustness of the models, we created 5 noisy variants of the multi-speaker dataset with 5 different SNR levels as described in Section III.We computed the performance of SBVQA 1.0, SBVQA 2.0, and TVLT models on each noisy data and we compared them with the performance of the models on the original clean data.Figure 3 shows that the SBVQA 2.0 model is robust against different background noises, where it maintained a similar performance even with lower SNR levels.In comparison, the SBVQA 1.0 and TVLT models performed poorly across all SNR levels, as shown in Figure 3 and Figure 4, indicating their lack of robustness to noise.

D. RQ4: IS THE E2E FINETUNING EFFECTIVE?
Since enhancing the performance of any component of the SBVQA 2.0 model would enhance the overall system performance, we selected the image encoder module to experiment with.The idea was to unfreeze the parameters of the image encoder module and jointly finetune it with the whole pipeline.In the first experiment, we unfroze all parameters of the image encoder and finetuned the model.Surprisingly, finetuning the whole image encoder resulted in a drastic drop in the performance.We did another experiment where we changed the learning rate from 2×10 −3 to 3×10 −6 and froze the whole image encoder except the last layer.Our intuition was that the ViT model has learned common features in most of the layers during pretraining, and finetuning them all could potentially damage the learned features [30], [31] [32].Furthermore, updating the model with a high learning rate could lead to forgetting the original data which may cause a degradation in the overall performance [33], [34].We finetuned the SBVQA 2.0 model with the new settings for 6 epochs and the accuracy of the model increased by 0.34%.This result indicates that the E2E finetuning is effective and could improve the overall performance of the model.Table 8 shows the results of these experiments.Examples of different questions on each image that show the regions that contributed the most to answering the question.We can see that for the same image we can get different responses from the image encoder based on the given question which shows that the question is a major contributor of directing the focus of the image encoder to the best region.Sometimes the question can be interpreted in different ways, which can result in an unexpected but plausible answer.For example, in Question #2 of Image #4, the model was expected to give an answer about the boat's location with respect to the image (on the right side), but instead, it answered that the boat is on the water, focusing on the background (water) rather than the intended context.This answer is a plausible answer but it is not the correct one.Furthermore, there are cases where the image encoder focuses on an incorrect region, resulting in a wrong answer.For example, in Question #1 of Image #6, the model focused on the person in the middle where it was supposed to focus on the person on the right, the answer was wrong due to the wrong region that the image encoder was looking to while answering.140976 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.) Examples of different questions on each image that show the regions that contributed the most to answering the question.We can see that for the same image we can get different responses from the image encoder based on the given question which shows that the question is a major contributor of directing the focus of the image encoder to the best region.Sometimes the question can be interpreted in different ways, which can result in an unexpected but plausible answer.For example, in Question #2 of Image #4, the model was expected to give an answer about the boat's location with respect to the image (on the right side), but instead, it answered that the boat is on the water, focusing on the background (water) rather than the intended context.This answer is a plausible answer but it is not the correct one.Furthermore, there are cases where the image encoder focuses on an incorrect region, resulting in a wrong answer.For example, in Question #1 of Image #6, the model focused on the person in the middle where it was supposed to focus on the person on the right, the answer was wrong due to the wrong region that the image encoder was looking to while answering.

VI. STATE-OF-THE-ART COMPARISON
We tested the final SBVQA 2.0 model, as depicted in Figure 1, on the official test sets of the SBVQA 2.0 dataset.We used the best checkpoint after finetuning the model, along with the image encoder that yielded the best performance on the val2014 dataset.Also, we used the best checkpoint of the SBVQA 1.0 model that we trained on SBVQA 2.0 dataset and we evaluated it.For TVLT [7] model evaluation, we used the original TVLT-VQA checkpoint.Our model achieved the best result on both the test-dev 2015 and test-std 2015 datasets.Table 9 and Table 10 show the results of the final SBVQA 2.0 model on all evaluation datasets compared to the existing models.

VII. CONCLUSION
In this paper, we proposed a new SBVQA architecture that we called SBVQA 2.0 which addresses the issues of SBVQA 1.0 model.Also, we developed SBVQA 2.0 dataset based on the previous SBVQA 1.0 and VQA 2.0 datasets to help in mitigating the issue of the inherited language priors in the SBVQA 1.0 dataset.We thoroughly discussed the SBVQA 2.0 architecture and we showed that it is speaker-independent, noise-robust, and can be finetuned in an end-to-end manner to get a better performance.Also, we showed that it is possible to transfer knowledge from text-based models to speech-based models when we used the image encoder from BLIP that was finetuned on the image captioning task.Based on our preliminary analysis, we observed that a blind model can predict the answer better than a deaf model.This finding raised new research questions: 1) Are errors more likely to be caused by the speech encoder or the image encoder?2) Are there inherent biases present in the SBVQA 2.0 dataset that contribute to the overall error of the model?
These questions can be explored in future work, as they hold significant implications for advancing our understanding of the underlying factors contributing to errors in the SBVQA system.By using SBVQA 2.0 model and dataset, we set a new baseline accuracy for the SBVQA task and we hope that our work will motivate other researchers to create novel models and help in advancing the research in this area.
140978 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

FIGURE 2 .
FIGURE 2. The types of speech encoders.

FIGURE 3 .
FIGURE 3. Evaluation results of SBVQA 1.0 and SBVQA 2.0 models on the multi-speaker dataset with different levels of noise.Both models were trained on the SBVQA 2.0 dataset.

FIGURE 4 .
FIGURE 4. Evaluation results of TVLT [7] model on the multi-speaker dataset with different levels of noise.

FIGURE 5 .
FIGURE 5. Evaluation results of SBVQA 1.0 model after training on SBVQA 1.0 and SBVQA 2.0 datasets.The evaluation was done on the val2014 dataset of each dataset.

FIGURE 6 .
FIGURE 6. UMAP projection of the utterance embeddings on a 2D space using different versions of WavLM model.

Figure 5
Figure 5 that training the SBVQA 1.0 model on the SBVQA 2.0 dataset has a significant positive impact on the model's ability to answer questions by incorporating information from both the image and the question.In the other hand, the results of SBVQA 1.0 model trained on the SBVQA 1.0 dataset show

FIGURE 7 .
FIGURE 7. UMAP projection of the utterance embeddings on a 2D space using different versions of Unispeech model.

FIGURE 8 .
FIGURE 8. UMAP projection of the utterance embeddings on a 2D space using Conformer-L model.

FIGURE 9 .
FIGURE 9. Examples of different questions on each image that show the regions that contributed the most to answering the question.We can see that for the same image we can get different responses from the image encoder based on the given question which shows that the question is a major contributor of directing the focus of the image encoder to the best region.Sometimes the question can be interpreted in different ways, which can result in an unexpected but plausible answer.For example, in Question #2 of Image #4, the model was expected to give an answer about the boat's location with respect to the image (on the right side), but instead, it answered that the boat is on the water, focusing on the background (water) rather than the intended context.This answer is a plausible answer but it is not the correct one.Furthermore, there are cases where the image encoder focuses on an incorrect region, resulting in a wrong answer.For example, in Question #1 of Image #6, the model focused on the person in the middle where it was supposed to focus on the person on the right, the answer was wrong due to the wrong region that the image encoder was looking to while answering.

FIGURE 9 .
FIGURE 9. (Continued.)Examples of different questions on each image that show the regions that contributed the most to answering the question.We can see that for the same image we can get different responses from the image encoder based on the given question which shows that the question is a major contributor of directing the focus of the image encoder to the best region.Sometimes the question can be interpreted in different ways, which can result in an unexpected but plausible answer.For example, in Question #2 of Image #4, the model was expected to give an answer about the boat's location with respect to the image (on the right side), but instead, it answered that the boat is on the water, focusing on the background (water) rather than the intended context.This answer is a plausible answer but it is not the correct one.Furthermore, there are cases where the image encoder focuses on an incorrect region, resulting in a wrong answer.For example, in Question #1 of Image #6, the model focused on the person in the middle where it was supposed to focus on the person on the right, the answer was wrong due to the wrong region that the image encoder was looking to while answering.

FIGURE 10 .
FIGURE 10.Visualization of self-attention heads of the last layer of the image encoder used in SBVQA 2.0 model after finetuning on SBVQA 2.0 dataset.We can see that different heads have learned different concepts from the image.For example, Head 3 and Head 10 have learned to focus on salient objects in the image, while Head 13 and Head 16 have learned to focus on the background of the image.Learning different concepts helps in answering various questions about the given image.

FIGURE 10 .
FIGURE 10. (Continued.)Visualization of self-attention heads of the last layer of the image encoder used in SBVQA 2.0 model after finetuning on SBVQA 2.0 dataset.We can see that different heads have learned different concepts from the image.For example, Head 3 and Head 10 have learned to focus on salient objects in the image, while Head 13 and Head 16 have learned to focus on the background of the image.Learning different concepts helps in answering various questions about the given image.

TABLE 2 .
Mean Silhouette score of each speech encoder with a 99% confidence interval computed over 100 unique questions.

TABLE 3 .
Evaluation results of SBVQA 2.0 models on val2014 dataset using different image encoders.

TABLE 4 .
Evaluation results of SBVQA 2.0 models on val2014 dataset using different versions of BLIP.

TABLE 5 .
Evaluation results of SBVQA 1.0 model on val2014 dataset.

TABLE 6 .
Evaluation results of SBVQA 2.0 model on val2014 dataset.

TABLE 7 .
Evaluation results of TVLT model on val2014 dataset.

TABLE 8 .
Evaluation results of SBVQA 2.0 model on val2014 dataset after unfreezing the image encoder module.

TABLE 9 .
Evaluation results on test-dev 2015 dataset.

TABLE 10 .
Evaluation results on test-std 2015 dataset.