Abstract:
In light of the remarkable progress made in automated image caption generation, it is still challenging to create captions that accurately reflect factual information and...Show MoreMetadata
Abstract:
In light of the remarkable progress made in automated image caption generation, it is still challenging to create captions that accurately reflect factual information and yet capture the nuances of human language. It tries to provide captions that accurately explain both the visual content and the complexity and indirectness of human emotion. The existing model represents a fusion of the CNN's capacity to comprehend the visual elements within an image and the RNN's expertise in crafting sequential language structures tailored to various visual contexts. This research paper combines the diverse methods of VITs for image understanding, pre-trained language models for language fluency and nuance, and fact-checking mechanisms to ensure factual accuracy. Attention algorithms and diversity checks improve the overall quality of captions provided. Reinforcement learning entails fine-tuning the model's performance iteratively.
Published in: 2024 IEEE International Students' Conference on Electrical, Electronics and Computer Science (SCEECS)
Date of Conference: 24-25 February 2024
Date Added to IEEE Xplore: 02 April 2024
ISBN Information: