I. Introduction
Visual Question Answering (VQA) refers to the task of answering natural language questions based on a given image. Although noticeable progress has been made, existing VQA benchmarks primarily focus on simple recognition questions (e.g., how many or what color) while neglecting the explanation of answering prediction. In light of this, the task of Visual Commonsense Reasoning (VCR) [1], [2] has recently been introduced to bridge the gap. Unlike traditional VQA that focuses on answering visual questions (Q A), VCR goes a step further by requiring the model to select the rationale behind the Q A process, which involves capturing visual commonsense (R), denoted as QA R.