Abstract:
This paper presents a novel methodology for the extraction and retrieval of images in RAG (Retrieval Augmented Generation) powered Question Answering Conversational Syste...Show MoreMetadata
Abstract:
This paper presents a novel methodology for the extraction and retrieval of images in RAG (Retrieval Augmented Generation) powered Question Answering Conversational Systems that circumvents the limitations of Optical Character Recognition and Large Language Model (OCR-LLM) powered traditional image retrieval approaches. We are leveraging the positional information of images in a vast array of multi-modal (text/image) documents for ingesting image information alongside text, followed by advanced retrieval and prompt engineering techniques to develop an RAG system that maintains the integrity of textual and visual data correlation in responses to queries pertaining to both text and images in QnA solutions and is adept at retrieving both OCR-compatible and OCR-incompatible images. We have successfully incorporated this approach over a variety of multimodal documents ranging from research papers, application documentations, surveys to guides and manuals containing text, images and even tables with images and managed to achieve SoTA (State of The Art) performance over simple to complex queries asked on the mentioned documents. Furthermore, our approach performed explicitly better in cases where Vision Models like GPT-4 Vision fails to accurately retrieve images which are OCR incompatible and pertains to highly customized scientific devices or diagrams and in cases where the image's visual representation is not semantically aligned with textual information but is important to be retrieved for completeness in the response.
Date of Conference: 25-27 July 2024
Date Added to IEEE Xplore: 08 October 2024
ISBN Information: