Impact Statement:Visual reasoning plays an essential role in understanding, interpreting, and reasoning about visual content, such as images and videos. It enables a wide range of applica...Show More
Abstract:
A scene graph is a key image representation in visual reasoning. The generalisability of Scene Graph Generation (SGG) methods is crucial for reliable reasoning and real-w...Show MoreMetadata
Impact Statement:
Visual reasoning plays an essential role in understanding, interpreting, and reasoning about visual content, such as images and videos. It enables a wide range of applications of artificial intelligence, including autonomous systems, semantic image search, and assistive technologies. Scene Graph Generation (SGG), a key component in this process, offers semantically rich image representations that are fundamental for visual reasoning. However, its reliance on data-centric methods leads to challenges from imbalanced datasets and limited relational scope, particularly in zero-shot SGG. The proposed method, leveraging common sense knowledge, significantly enhances the generalisability of SGG. It notably boosts zero-shot recall rate by 59.96% on the standard benchmark and demonstrates cross-dataset generalisability. This advancement facilitates more accurate and intuitive visual reasoning and encourages further research on knowledge-based approaches for generalised SGG to extend and enhance...
Abstract:
A scene graph is a key image representation in visual reasoning. The generalisability of Scene Graph Generation (SGG) methods is crucial for reliable reasoning and real-world applicability. However, imbalanced training datasets limit this, underrepresenting meaningful visual relationships. Current SGG methods using external knowledge sources face limitations due to these imbalances or restricted relationship coverage, impacting their reasoning and generalisation capabilities. We propose a novel neurosymbolic approach that integrates data-driven object detection with heterogeneous knowledge graph-based object refinement and zero-shot relationship retrieval, highlighting the loosely coupled synergy between neural and symbolic components. This combination addresses the limitations of imbalanced training datasets in scene graph generation and enables effective prediction of unseen visual relationships. Objects are detected using a region-based deep neural network and refined based on their...
Published in: IEEE Transactions on Artificial Intelligence ( Early Access )