I. Introduction
This project aims to create an automated image annotator with voice synthesis using the CNN, LSTM and Python, which can generate easily understandable descriptions of images. This technology has become increasingly popular in recent years. This model first extracts the features from the given image by using CNN and then the LSTM architecture analyzes those features to predict the correct textual output description. Finally, it generates a text description of the image and its features, which is converted into speech via voice synthesis.