Policy Learning-Based Image Captioning With Vision Transformer | IEEE Conference Publication | IEEE Xplore

Policy Learning-Based Image Captioning With Vision Transformer


Abstract:

In light of the remarkable progress made in automated image caption generation, it is still challenging to create captions that accurately reflect factual information and...Show More

Abstract:

In light of the remarkable progress made in automated image caption generation, it is still challenging to create captions that accurately reflect factual information and yet capture the nuances of human language. It tries to provide captions that accurately explain both the visual content and the complexity and indirectness of human emotion. The existing model represents a fusion of the CNN's capacity to comprehend the visual elements within an image and the RNN's expertise in crafting sequential language structures tailored to various visual contexts. This research paper combines the diverse methods of VITs for image understanding, pre-trained language models for language fluency and nuance, and fact-checking mechanisms to ensure factual accuracy. Attention algorithms and diversity checks improve the overall quality of captions provided. Reinforcement learning entails fine-tuning the model's performance iteratively.
Date of Conference: 24-25 February 2024
Date Added to IEEE Xplore: 02 April 2024
ISBN Information:

ISSN Information:

Conference Location: Bhopal, India

References

References is not available for this document.