Abstract:
This paper introduces a groundbreaking enhancement to image captioning through a unique approach that harnesses the combined power of the Vision Encoder-Decoder model. By...Show MoreMetadata
Abstract:
This paper introduces a groundbreaking enhancement to image captioning through a unique approach that harnesses the combined power of the Vision Encoder-Decoder model. By leveraging the Swin Transformer as the image encoder and RoBERTa as the text decoder, this architecture seamlessly integrates cutting-edge technologies in computer vision, natural language preprocessing, and big data. To demonstrate the efficacy of this innovative methodology, a rigorous evaluation has been conducted on the Flickr30k dataset allows for comprehensive examination and comparison of the models' performance against current state-of-the-art methods. Further-more, the explanation of the Flickr30k dataset expands the evaluation's reach and effectively addresses the ever-changing demands of image captioning challenges. By broadening the analysis to include Flickr30k, this study makes a valuable contribution to our understanding of the models' ability to adapt to diverse datasets, making it more suitable for real-world applications. The results of the evaluation on Flickr30k showcase the models' versatility and effectiveness in generating accurate and contextually relevant captions. By combining the power of Swin Transformer and RoBERTa, the proposed architecture not only outperforms established benchmarks but also solidifies it's position as a robust solution for tackling the complexities inherent in image captioning challenges.
Date of Conference: 12-14 May 2024
Date Added to IEEE Xplore: 11 July 2024
ISBN Information: