Loading [MathJax]/extensions/MathMenu.js
How to Use Language Expert to Assist Inference for Visual Commonsense Reasoning | IEEE Conference Publication | IEEE Xplore

How to Use Language Expert to Assist Inference for Visual Commonsense Reasoning


Abstract:

Visual Commonsense Reasoning (VCR) task requires Vision and Language Model (VLM) to capture cognitive level clues from the visual-language input and give the right answer...Show More

Abstract:

Visual Commonsense Reasoning (VCR) task requires Vision and Language Model (VLM) to capture cognitive level clues from the visual-language input and give the right answers to questions and their rationales. Recently, although Pretrained Language Model (PLM) has been taken as a powerful in-domain knowledge base to the various tasks like image segmentation and visual question answering, PLM remains unexplored to generalize to the unseen multi-modal data in an out-domain way. In this paper, we explore how to use PLM to assist VLM for the challenging VCR task and propose a framework called Vision and Language Assisted with Expert Language Model (VLAELM). The VLAELM aims to employ a PLM with expert level of commonsense knowledge to assist reasoning, which is difficult for the VLM learning just from scarce multi-modal data. The experiments show that VLAELM achieves significant improvements against the strong baselines. Moreover, we validate credibility for language expert as knowledge base and measure application value between generalization and specialty in PLM.
Date of Conference: 01-04 December 2023
Date Added to IEEE Xplore: 06 February 2024
ISBN Information:

ISSN Information:

Conference Location: Shanghai, China

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.