1. Introduction
Multimodal reasoning is a vital ability for humans and a fundamental problem for artificial intelligence [27], [39], [8]. Despite the promising performance of deep neural networks on various multimodal reasoning tasks [35], [37], [47], [34], [36], existing models typically generate reasoning results without explaining the rationale behind their results. Consequently, the low explainability of the generated results severely reduces the credibility and restricts the application of reasoning models. To address this issue, Chen and Zhao [11] recently proposed Explanatory Visual Question Answering (EVQA) task, which expands upon Visual Question Answering (VQA) [5], [15] by requiring multimodal reasoning explanations. As shown in Figure 1, while traditional VQA aims to answer a question with a related image, EVQA goes further by demanding an explanation of the reasoning process. This extension creates the possibility for improved explainability and credibility of reasoning models.