Abstract:
Given a question about an image, Visual Commonsense Reasoning (VCR) needs to provide not only a correct answer, but also a rationale to justify the answer. VCR is a chall...Show MoreMetadata
Abstract:
Given a question about an image, Visual Commonsense Reasoning (VCR) needs to provide not only a correct answer, but also a rationale to justify the answer. VCR is a challenging task due to the requirement of proper semantic alignment and reasoning between the image and linguistic expression. Recent approaches offer a great promise by exploring holistic attention mechanisms or graph-based networks, but most of them do implicit reasoning and ignore the semantic dependencies among the linguistic expression. In this paper, we propose a novel explicit cross-modal representation learning network for VCR by incorporating syntactic information into the visual reasoning and natural language understanding. The proposed method enjoys several merits. First, based on a two-branch neural module network, we can do explicit cross-modal reasoning guided by the high-level syntactic structure of linguistic expression. Second, the semantic structure of the linguistic expression is incorporated into a syntactic GCN to facilitate language understanding. Third, our explicit cross-modal representation learning network can provide a traceable reasoning-flow, which offers visible fine-grained evidence of the answer and rationale. Quantitative and qualitative evaluations on the public VCR dataset demonstrate that our approach performs favorably against state-of-the-art methods.
Published in: IEEE Transactions on Multimedia ( Volume: 24)