Abstract:
Automatic description of image has attracted many researchers in the field of computer vision for captioning the image in artificial intelligence which connects with Natu...Show MoreMetadata
Abstract:
Automatic description of image has attracted many researchers in the field of computer vision for captioning the image in artificial intelligence which connects with Natural Language Processing. Exact generation of captions to image is necessary but it lacks due to Gradient Diminishing problem, LSTM can overcome this problem by fusing local and global characteristics of image and text that generates sequenced word prediction for accurate image captioning. We consider Flickr 8k data-set which consists of text as descriptions of images. The use of GLoVe embedding helps for the word representation to consider the global and local features of images which finds distance with Euclidean to understand the relationship between words in vector space. Inception V3 architecture which is pretrained on ImageNet used to extract image features of different objects in scenes. We propose Linear Sub-Structure that helps to generate sequenced order of words for captioning by understanding relationship between words. For extracting image features considers co-variance shift which mainly concentrates on moving parts of the image to generate accurate description of the image to maintain a semantic visual grammar relationship between the predicted text for image as the caption, the proposed model evaluated with the help of BLEU score which achieves state of art model in our work while compared with others that has greater than 81% of accuracy.
Date of Conference: 26-27 November 2021
Date Added to IEEE Xplore: 29 December 2021
ISBN Information: