Loading [MathJax]/extensions/TeX/ieee_stixext.js
Relation Inference Enhancement Network for Visual Commonsense Reasoning | IEEE Journals & Magazine | IEEE Xplore

Relation Inference Enhancement Network for Visual Commonsense Reasoning


Abstract:

When presented with a question regarding an image, Visual Commonsense Reasoning (VCR) offers not only a correct answer but also a rationale to justify the answer. Existin...Show More

Abstract:

When presented with a question regarding an image, Visual Commonsense Reasoning (VCR) offers not only a correct answer but also a rationale to justify the answer. Existing methods simply combine features from multiple modalities onto a shared dimension space, which doesn't align with human reasoning patterns, resulting in inadequate cross-modal and intra-modal reasoning behaviors. On the one hand, inadequate cross-modal reasoning arises from existing models relying on semantic correlations between answers and rationales in both textual modalities rather than the generative process of human reasoning from visual to textual modality. On the other hand, inadequate intra-modal reasoning arises from the incapacity of existing models to leverage previously acquired object relations beyond current observations like humans. To this end, we propose a novel Relation Inference Enhancement Network (RIE-Net), which enhances reasoning ability based on cross-modal image analysis and introduces intra-modal relational reasoning modules to memorize reasoning knowledge. To enhance the cross-modal association between images and rationales, RIE-Net introduces a cross-modal image analysis module, which eliminates language bias between answers and rationales by generating rationale from images. In addition, to comprehend and retain relational knowledge, RIE-Net introduces intra-modal relational reasoning modules to capture prior knowledge associated with various object categories and enhance the model's understanding of visual-spatial relationships. Quantitative and qualitative evaluations of the public VCR dataset demonstrate that our approach performs favorably against state-of-the-art methods.
Published in: IEEE Transactions on Multimedia ( Early Access )
Page(s): 1 - 11
Date of Publication: 24 December 2024

ISSN Information: