Variational Causal Inference Network for Explanatory Visual Question Answering | IEEE Conference Publication | IEEE Xplore

Variational Causal Inference Network for Explanatory Visual Question Answering


Abstract:

Explanatory Visual Question Answering (EVQA) is a recently proposed multimodal reasoning task that requires answering visual questions and generating multimodal explanati...Show More

Abstract:

Explanatory Visual Question Answering (EVQA) is a recently proposed multimodal reasoning task that requires answering visual questions and generating multimodal explanations for the reasoning processes. Unlike traditional Visual Question Answering (VQA) which focuses solely on answering, EVQA aims to provide user-friendly explanations to enhance the explainability and credibility of reasoning models. However, existing EVQA methods typically predict the answer and explanation separately, which ignores the causal correlation between them. Moreover, they neglect the complex relationships among question words, visual regions, and explanation tokens. To address these issues, we propose a Variational Causal Inference Network (VCIN) that establishes the causal correlation between predicted answers and explanations, and captures cross-modal relationships to generate rational explanations. First, we utilize a vision-and-language pretrained model to extract visual features and question features. Secondly, we propose a multimodal explanation gating transformer that constructs cross-modal relationships and generates rational explanations. Finally, we propose a variational causal inference to establish the target causal structure and predict the answers. Comprehensive experiments demonstrate the superiority of VCIN over state-of-the-art EVQA methods.
Date of Conference: 01-06 October 2023
Date Added to IEEE Xplore: 15 January 2024
ISBN Information:

ISSN Information:

Conference Location: Paris, France

Funding Agency:


1. Introduction

Multimodal reasoning is a vital ability for humans and a fundamental problem for artificial intelligence [27], [39], [8]. Despite the promising performance of deep neural networks on various multimodal reasoning tasks [35], [37], [47], [34], [36], existing models typically generate reasoning results without explaining the rationale behind their results. Consequently, the low explainability of the generated results severely reduces the credibility and restricts the application of reasoning models. To address this issue, Chen and Zhao [11] recently proposed Explanatory Visual Question Answering (EVQA) task, which expands upon Visual Question Answering (VQA) [5], [15] by requiring multimodal reasoning explanations. As shown in Figure 1, while traditional VQA aims to answer a question with a related image, EVQA goes further by demanding an explanation of the reasoning process. This extension creates the possibility for improved explainability and credibility of reasoning models.

Contact IEEE to Subscribe

References

References is not available for this document.