Automatic Image Annotation Based on Deep Learning Models: A Systematic Review and Future Challenges

Recently, much attention has been given to image annotation due to the massive increase in image data volume. One of the image retrieval methods which guarantees the retrieval of images in the same way as texts are automatic image annotation (AIA). Consequently, numerous studies have been conducted on AIA, particularly on the classification-based and probabilistic modeling techniques. Several image annotation techniques that performed reasonably on standard datasets have been developed over the last decade. In this paper, a review of the image annotation method was conducted, focusing more on deep learning models. Automatic image annotation (AIA) methods were also classified into five categories, including i) Convolutional Neural Network (CNN) based on AIA, ii) Recurrent Neural Network (RNN) based on AIA, iii) Deep Neural Networks (DNN) based on AIA, iv) Long-Short-Term Memory (LSTM) based on AIA, and v) Stacked auto-encoder (SAE) based on AIA. An assessment of the five varieties of AIA methods was also offered based on their principal notion, feature mining technique, explanation precision, computational density, and examined aggregated data. Moreover, the evaluation metrics used to evaluate AIA methods were reviewed and discussed. The need for careful consideration of methods throughout the improvement of novel procedures and datasets for image annotation assignment was highly demanded. From the analysis of the achievements so far, it is certain that more attention should be paid to automatic image annotation.


I. INTRODUCTION
The progressively cumulative volume of ordinal images and the need to meet the users' requirements for gigantic data volumes have necessitated an accurate and efficient image retrieval technology. One of the image retrieval methods which guarantees the retrieval of images in the same way as texts are automatic image annotation (AIA). According to Barnard et al. [1], AIA is an important problem in computer vision. As images often contain complex and different kinds The associate editor coordinating the review of this manuscript and approving it for publication was Tao Zhou . of content information, query, retrieve, and organize image information quickly and effectively becomes a crucial issue. AIA can be applied in various fields, including online/offline data exploration, image manipulation, and annotation application used in mobile gadgets [2]- [4]. In a typical image annotation system, two things are significant; (i) a semantic appreciation of ordinal images and (ii) a natural language processing (NLP) unit which will interpret the images' semantic data into an output that a human can read. Various methods have recently been proposed on AIA systems, giving rise to several AIA algorithms. These methods contain the practice of texture resemblance, Support Vector Machines, Bayesian, VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ and Instance-based methods [1]- [6]. However, deep learning techniques have, over the last decade, performed excellently in image processing. Visual attention has also been successfully deployed with deep neural networks in many NLP and computer revelation methods. Its usage for image annotation has also been reported in several studies [7]- [10]. Despite the prevailing deep learning-based methods to improve AIA frameworks' enactment, AIA is still prone to numerous key challenges. Among these challenges is its requirement of a huge data volume to perform an accurate prediction. The other two major challenges of AIA are the management of imbalanced keywords distribution, as well as the selection of appropriate features. Previous works on AIA have developed several deep learning procedures, such as Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and Deep Neural Networks (DNN) to address these issues. However, inspired by the cranial nerve theory, DNN has started to become widely used in the arena of computer vision, NLP, and so on. In 1943 [11], Warren McCulloch and Pitts proposed and presented the Artificial Neural Network (ANN) concept and the mathematical model of artificial neurons, which is considered the foundation for the theory of neurons in biology and physiology. The milestone in ANN research is the invention of the Backpropagation algorithm (BP) [12]. The ANN is closer to the human brain in structure, principle, and function. It can adapt to the environment itself, summarize laws, perform some operations, identification, or process control. It was not until 2012 that the ANN became popular due to Deep Convolutional Neural Network's realization in image classification [13], [14]. Although deep learning methods can efficiently handle huge data, their efficiency usually decreases with increases in the model's complexity and scope. Additionally, for outsized-scale datasets, many systems do not ruminate the unique labels of the datasets. This work focuses on five categories of deep learning techniques based on AIA; these categories are CNN based on AIA, RNN based on AIA, DNN based on AIA, Long-Short-Term Memory (LSTM) based on AIA, and Stacked Auto-Encoder (SAE) built on AIA. Furthermore, an analysis and comparison of these AIA methods were performed based on their basic concepts, the main contribution, the annotation accuracy, and the computational complexity. The remaining part of this work is organized; thus: Section 2 discussed image segmentation and feature extraction, while section 3 described various deep learning-based AIA techniques. Section 4 presented the summary and conclusion of the review.

II. IDEEP LEARNING FOR IMAGE ANNOTATION
Deep learning is responsible for the dramatic advancements in state-of-the-art AI-based research such as speech recognition, entity recognition, and machine translation. Deep learning could be used to solve numerous complicated AI tasks due to its deep architecture [15]. Consequently, deep learning is currently extended to several modern tasks and domains; this is in addition to conventional errands such as surface acknowledgment, etymological prototypes, or object discovery. For instance, the study by [16] reported the use of a recurrent neural network (RNN) to de-noise speech signals. In contrast, the discovery of gene expression and clustering patterns using SAE has been reported [17]. Another study by [17] generates images with different styles using a neural model, while [18] depended on deep learning to permit simultaneous sentiment analysis from numerous modalities. This era will experience a boost in deep learning-related studies. Deep learning does implement better than other machine learning procedures as the pragmatic outcomes recommend. Some have proposed that it is for the reason that deep learning can roughly impersonate the brain's purposes (numerous deposits of neural networks arranged one after an additional like the conventional brain prototypical). Nevertheless, there is no vigorous speculative context for deep learning [5]. Usually, deep learning technologies execute superior to the predictable ML implementations due to their training on the feature extraction part. With deep learning methods, feature hierarchies are learned so that features from the higher hierarchy are formed by compiling features from the subordinate hierarchy. The automatic learning of features at manifold abstraction levels will tolerate a method to acquire intricate functions; it will help the system directly map the input to the output without depending on the human-crafted features [5]. For instance, during image recognition, the normal system is to remove/fetch and feed the programmed features to SVM. In the deep learning schemes, the extracted features are also optimized, and this is why deep learning methods perform better Figure 1 describes the reasoning for the use of deep learning. Deep learning differed from traditional ML in its performance as the volume of data increases. With small data, deep learning performance is not nice because it necessitates enormous information to achieve perfect learning. Contrarily, traditional ML algorithms work better with small data owing to their handcrafted rules. This fact is summarized in the image below [18].

III. FEATURE EXTRACTION
This section discussed feature extraction (FE), an important step in an AIA model to convert raw images into features. There are two categories of image features; these are low-level and high-level image features. The low-level 50254 VOLUME 9, 2021  image features such as shape, color, and texture are extracted  via image processing, while the high-level features represent the words or concepts from an image. Furthermore,  image features used in the existing AIA techniques could  also be classified into the region and global image features;  the region-based features require image segmentation while  global image features are calculated from all the images. There are two representations of global image features; these are gist and color histograms [19]. Three color spaces (RGB, LAB, and HSV) are involved in the extraction of color histograms and these color spaces are the most utilized in computer vision. The local features can capture more semantic image contents compared to the comprehensive features.
The scale-invariant feature transform (SIFT) and a vigorous type descriptor are normally implemented as two confined features. Both features were considered in this study to ensure appropriate image representation. A description of the structure of both features is presented as: A. Low-level image features: These are a combination of features or autonomous entities in an image [9]. They provide a specific description of the images' components, such as the background, color, texture, or shape by concentrating on the basic micro details of images [15]. B. High-level features: These features are important image representation attributes as they represent the image from a global perspective and refers to the concept or definition of an image [9], [20]. These features can mimic the human perceptual system efficiently.

IV. IMAGE SEGMENTATION
In most studies, segmentation methods that depend mainly on the color space of an image are utilized. These methods are mainly used for the efficient local or global extraction of image visual features via image segmentation. For the global methods, a single set of features is computed from the entire image. In contrast, the local methods work by partitioning the images into blocks or regions before computing a set of features for each block. Thus, images can be represented with features at the object level and still provide spatial image information. However, the unsupervised segmentation associated with region features may affect their accuracy since segmentation performance is normally dependent on the applications. Among the popular algorithms for image, segmentation is grid-based techniques, clustering-based techniques, contour-based techniques, region growing-based techniques, and statistical model-based techniques [6]. The variance intra-cluster maximization method is one of the efficient image segmentation methods because, in grey-level images, it ensures the selection of a global threshold value by maximizing the separability of the classes [24].

V. AUTOMATIC IMAGE ANNOTATION METHODS USING DEEP LEARNING
In this section, a brief review of the deep learning methods for AIA was conducted. These methods are classified

A. CONVOLUTION NEURAL NETWORK (CNN)
CNN or ConvNet represents a class of deep feed-forward ANNs that are mostly used in visual image analysis. It is a well-known DL architecture that got stimulated by the expected visual sensitivity tool of active things. Several CNN architecture variants exist in the literature, such as LeNet-5, Alexnet, VGG, GoogleNet, and Deep Residual Learning [16]. These variants of CNN are, however, similar in their basic components. For instance, the famous LeNet-5 comprises three basic layers (convolutional, pooling, & fully-connected) depicted in Figure 2. It describes the input feature representation learned by the convolutional layer. This layer consists of several convolution kernels that help in dissimilar characteristics maps computation. For individual neurons, its feature map is directly linked to a region of nearby neurons in the preceding layer (a region known as the neuron's receptive field in the preceding layer). The input will first be convolved with a trained kernel before smearing a component-wise nonlinear triggering function on the convolved outcomes to obtain the new feature map. It should be noted that before generating each feature map, all the inputs' spatial locations must share the same kernel. Different kernels are required to obtain the complete feature maps [15]. As presented in Figure 3, Max Pooling Layer commonly involves the periodic insertion of a pooling layer in-between VOLUME 9, 2021 successive conventional layers in a ConvNet architecture. This layer's function ensures a reduced level of computation and the expanse of considerations in the network by progressively minimizing the representation's spatial size, thereby controlling overfitting. The pooling layer's operation is not dependent on the response's depth slice as it spatially resizes it via MAX operation. The commonest form of pooling layer is a pooling layer with 2 × 2 filters, which is applied with the progress of two down samples per complexity portion in the response by two along both width and height, leaving 75% of the activations [25].
The performance of numerous computer vision tasks has been improved by CNNs owing to their associated modeling complexity when learning from a huge volume of supervised data. Numerous models have been suggested for CNN-based AIA and retrieval; some of these models are discussed as follows. The combination of the CNN feature of an image with a semantic extension model (SEM) using the well-known CNN model-AlexNet has been proposed by Ma et al. [4]. The study extracted the CNN features by stripping its final layer; this proved to be a useful SEM model approach. The performance of the SEM was evaluated on corel5k [5], esp-Game [26], and Iaprtc12 [8], which are 3 publicly available standard image annotations datasets. However, this model is disadvantaged by the inadequacy in its precision due to the non-uniform tags distribution in the large dataset, making the model enlarge the prediction tags when searching for the neighbor group of an image, thereby causing a decrease in the precision. Another study by Wang et al. [5] modified the CNN model of CaffeNet to build a large-scale model called MVAIACNN for image annotation. In the proposed MVAIACNN, the layers are shallow. Each category is directly regarded as a label; it performs large-scale image annotation using raw images as inputs and depends on CNN for large-scale image annotation. The performance of the model was evaluated on the MIRFlickr25K and NUSWIDE datasets. To address a fixed number of labels appearing during the multilabel image annotation process and label annotation according to the ranking function. In [31] the application of a CNN-THOP for image annotation. First, a CNN model is used to predict the probability for each class of labels. Due to the VGG16 network architecture's merits, we improved the CNN structure in this study based on VGG16. A BN added within the CNN significantly accelerates the convergence speed, and the network structure and parameters are adjusted to make them more suitable for the datasets. In another approach, Luo et al. [27] suggested a novel CNN-based technique of annotating power grid objects' images. The images' attribute list is first obtained before building the image database for power grid objects. This database is comprised of a huge number of images in multilabel networks. Then, the image is annotated using an attribute-specific segmentation model. the accuracy of the proposed method was evaluated and found to be 94.83 %. Lin et al. [28] focused on the feature combination technique for image annotation and retrieval. This method utilizes low-level color features of the original images, while the extracted features are learned from CNN's. The progression of the projected technique is as follows: i) extract the color feature from the original images, build a visual lexicon, and use a bigram to present a co-occurrence relation, ii) input images into the CNN through five convolutional layers and let the pooling layer attain a visual feature, and iii) combine the two features at the first hidden layer of the DNN system. Mahmood et al. [29] suggested a DL and computer visionbased framework for automatic unlabelled coral images' annotation. This proposed framework depends on a novel coral classification framework that exploits the robust image representation capability of CNNs. Owing to the lack of basic truth labels for numerous coral reef images, the loop incorporates a human expert to validate the new method's accuracy. The trained coral classifier was used to analyze the coral reefs of the Abrolhos Islands, which is considered one of the unique marine areas in Western Australia. The unlabelled coral mosaics of 3 sites of this coral reef (covering two-year new method's accuracy increase in the method's performance was observed when Abrolhos Islands' performing all years. This indicates the challenges faced during the training phase; the test set images were collected over several years, which could affect the performance. This is mainly caused by the time-related changes occurring in the coral reefs. However, the major contributors to misclassification are the uncertain corals-non-corals boundaries. A study by Sarangi et al. [30] suggested a technique that performs image annotation using convex DL models such as Tensor Deep Stacking Networks (T-DSN) and Kernel Deep Convex Network (DCN). The study also proposed the use of features extracted with DCN as the input to the convex models. Observably, the convex models with DCN-extracted features as input provided the best performance. The extraction of the features becomes easier after training the convolutional network on a large image set. The convex networks' training time is short; this makes them ideal for image annotation tasks. Considering the K-CDN and T-DSN models, it was observed that, in each hidden layer, it is not beneficial to have different numbers of nodes. In this approach, no criterion for selecting a proTheer nodes networks' training times ended; however, it will be useful to determine the appropriate number of nodes and global optimum parameters for T-DSN and DCN, respectively. Simple and effective image annotation models that depend on CNN-extracted image categories and expression set in vectors to capture their allied labels were presented by Venkatesh et al. [11]. the versatility of the mid-level deep learning model's visual model, a method of extracting the mid-level convolutional characteristics, is developed and studied. Based on this, an image annotation method based on positive examples is proposed. The deep learning mid-level convolution feature extraction method used in the paper does scale dataset training model [37] model's visual meaning that the deep feature data volume and hardware costs are reduced. The first set of models that model the visual features and textual features of the data were based on the Canonical Correlation Analysis (CCA) framework. The last layer of CaffeNet in the CNN-based model was substituted with a projection layer (for regression tasks); the subsequent network was then trained for semantic mapping of images' evocative word embedding vectors. There are two advantages of this type of modeling; i) numerous handcrafted features are not required; hence, metric learning or how to effectively combine those features is not important; ii) it is a relatively easier approach to formulate than the other discriminative or generative models. Additionally, when used in the earlier models, it improves the effectiveness of CNNextracted features. An AIA approach that depends on DL models has been presented by Kashani et al. [32]. At first, the approach uses CNN models for feature extraction from the feedback image. The mined feature vectors are against all the training images; the most relevant tags are allocated to the participant image. This approach is a search-based annotation method that leverages deep architectures (i.e., CNN) as feature extractors. In this way, the proposed approach takes advantage of the search-based method and deep models simultaneously. CNN features are extracted from images using pre-trained models such as Caffe, VGG-16, VGG-19, and ResNet networks. Zhang et al. [10] presented a technique for feature extraction based on DL for annotating skin biopsy histopathological images. They also used CNN as a feature training model. The model depended on both nonlinear transformations of the original features and multiple-layer weighted combinations to learn the abstract features. For the generation of the annotation results, a supervised MIML learning model was placed on top of the DL model. However, some problems remain to be solved; the first one is the model's capability to execute only region-based administered learning. The uncertainty of the level at multiple-instances makes it impossible to propagate the model output loss through the network, leading to the inability to conduct a supervised fine-tuning of the network weights. The second problem relates to the scheme of the multiple-instance data sample CNN. This problem probes whether it is possible to build a CNN model that can scramble a manifold-instance model rather than just an occurrence. Rajchl et al. [34] suggested combining a NN model with an iterative graphical optimization technique. This combination aims to develop a model that can recover pixel-wise object segmentations with the associated bounding box annotations from an image database. This concept was based on the popular Grab-Cut [35] approach, which involves an iterative fitting of an intensity appearance model to a segment and later regularizes it to achieve segmentation. Similarly, the suggested DeepCut approach iteratively updates the training targets using a CNN model and regularizes the segmentation using a fully connected conditional random field (CRF). A generic form of this approach was formulated; thus, it can be easily applied to any image or object modality [40]. This article proposed an endto-end deep learning framework for multi-label annotation of RS images that exploits dual-level semantic concepts. The framework includes a shared CNN for visual feature learning, a classification branch for multilabel annotation, and an embedding branch for maintaining the similarity relationships among the triplet images grouped by scene-level semantic concepts. An attention mechanism is introduced in the classification branch for salient object detection, while the skip connection is incorporated to combine information from multiple layers. The proposed method's main drawback is that it fails to consider the label dependences at the object level and the label relationships between the scene and the object level. Thus, we plan to adopt the RNN to model the label relationships between the intralevel and interlevel semantic concepts of RS images in future work.

B. RECURRENT NEURAL NETWORK (RNN)
The RNN, also called the Elman network, has three layers in each period [36]. The design of RNN is illustrated in Figure 4. The three layers in each of its periods are the feedback word layer w, the recurrent layer r, and the output layer y, and the activation of these layers at time t is represented as w(t), r(t), and y(t), respectively. W(t) is the existing word trajectory that can take the form of a simple 1-of-N coding representation h(t) (i.e. the one-hot representation; it is binary, and its dimension is the same as that of the dictionary size with just one non-zero component). The calculation of Y (t) can proceed thus (Mikolov et al. 2010): here, x (t) represents the concatenation vector for w (t) & r (t-1), while f1 (:) is the element-wise sigmoid & g1 (:) is the softmax function. U & V are the learnable weights. A simple RNN model is typified in Figure 4. In the RNN, the size of the network is a function of the input sequence length. The recurrent layers connect the sub-networks at different time frames. Hence, when executing backpropagation, there is a need to ensure the inaccuracy is disseminated back in time via recurrent connections [12]. The study by Tsochatzidis et al. [20] presented a method of facilitating object instances annotation. In the approach, a polygon that outlines an object is predicted using Polygon-RNN, while the corrections are easily incorporated from an annotator in the loop. The system's analysis showed that it achieved an improved annotation speed of up to factor 4.74 and its annotation agreement was the same as that performed by human annotators. This approach is advantageous because it offers structurally plausible object annotations and permits predefined annotation accuracy by the annotator in just a few clicks.
A multimodal RNN (m-RNN) framework was presented by Mao et al. [39]. For three tasks sentence generation, this platform works efficiently as the method adopted, sentence retrieval based on the probe image, and image reclamation based on the query sentence. The model components include a deep RNN and a deep CNN, which interact in a multimodal layer. The proposed m-RNN is strong in connecting sentences and images; more sophisticated language models and complex image representations could also be easily introduced into the system. The study by Joonmyun et al. [22] presented an approach for automatic extraction of subject-allied keywords from users' natural language observations on their media files. Here, 'theme' refers to the concepts that described media files' content, such as natural sites, pets, places, and palaces. In this approach, RNN was employed; RNN is good at implicit pattern recognition in sequential data. A semantically regularized CNN-RNN model was proposed by Feng et al. [23] for image annotation. With this semantic regularization, the CNN-RNN interface becomes semantically significant. It ensures the distribution of the correlation tasks and label prediction between both models and makes the full models' training more efficient and stable. The approach was evaluated on NUS-WIDE & MSCOCO and it showed efficacy on both image captioning and multilabel classification. A model for image regions' natural language descriptions generation was projected by Karpathy et al. [41]. This model is based on weak labels in image and sentence datasets and with few hardcoded assumptions. In this approach, there is a well-worn classification model for aligning parts of language and visual modalities. This model was shown to provide good image-sentence ranking performance. Also described was an m-RNN ar REASON architecture that generates visual data descriptions whose performance was evaluated on both region-level and full-frame experiments. In both cases, them-RNN performed better than other retrieval models. Most of the prevailing CNN-RNN-based techniques are prone to misprediction and object missing due to their dependence on global representation at the image-level. Hence, Linghui et al. [9] addressed this issue by proposing the global-local attention (GLA) method for image caption. The GLA method was believed to selectively focus on the semantically important image regions while maintaining the global context information via attention mechanisms to combine the native features (at object-level) and the global features (at image-level). When evaluated on Microsoft COCO caption datasets, the proposed GLA showed good performance using different evaluation metrics. There are two aspects of the advantages of RNN in AIA; one is its ability to generate outputs of varying lengths. The other is that it can predict the current time step output by recalling the previous inputs. However, the RNN-based image annotation method's disadvantage is the inconsistency between the value R and P and the estimation of imprecise semantic classes of a word to the precise connotation of the word in the text due to the noise in the attributes used to sequence the data.

C. DEEP NEURAL NETWORK (DNN)-BASED AIA
A DNN is a network characterized by a definite convolution level; it is a neural network consisting of more than two layers. The DNNs depend on sophisticated mathematical modeling for data processing in complex ways. Chengjian et al. [42] developed a new framework of a multimodal deep learning network to learn transitional depictions and deliver a decent network initialization. Then, the distance metric functions on each modality were optimized using backpropagation; finally, an optimization of the combinational weights of different modalities was performed by applying the exponentiated gradient online learning algorithm. Advance deep learning research, which will emphasize the number of feature proportions for achieving a satisfactory system performance for a given neural network framework, is necessary. Another characteristic to consider is the tool to improve a specified deep learning architecture and progress its strength. Yang et al. [43] proposed a new MVSAE model for a joint establishment of the correlations between high-level semantic keywords and low-level image features for automatic image annotation. First, the SAE was modified by using an iteration algorithm and a sigmoid function predictor. Then image keywords were solved with an imbalanced distribution. The influence of the imbalance learning method at different levels of keyword frequency varies. The F1 score decreases slightly towards high-frequency keywords because of a low frequency to cause a high-frequency keyword's misclassification. Contrarily, the low-level frequency keywords present a better performance compared to the original SAE. A multi-view stacked auto-encoder (MVSAE) framework has been proposed by Yang et al. [43] for finding the correlations between high-level semantic information and low-level visual features. Experiments on three popular datasets proved the proposed framework's effectiveness in achieving a favorable image annotation performance. The DNNs with multiple nonlinear hidden layers can learn complex input-output relationships; however, the network can be exposed to local optima and convergence difficulty due to the nonlinear mapping between the outputs and the inputs when using the BP algorithm [36].

D. LONG-SHORT-TERM MEMORY (LSTM) BASED ON AIA
When modeling temporal dynamics in sequences, RNN is a good choice even though traditional RNN finds it difficult to study long-term dynamics due to the issue of exploding and vanishing gradients to address these issues, the LSTM network was proposed [14] and the core of its architecture (refer to Figure 5) is a memory cell for storing the state of the cells over time; there is also the gates for controlling how and when the states of the cells can be updated. Between the gates and the memory cells exists several variants with different connections. Word generation in the LSTM model depends mainly on the word's embedment at the current time-step and on the preceding hidden state (this includes the pre-information of the image) [50]. This process is systematically sustained until the end token of a sentence is encountered. Meanwhile, the progression of this procedure weakens the relevance of the image data served at the beginning. The generated words at the start of a sequence are also prone to the same issue; hence, the generation may be performed almost ''blindly'' for a long sentence towards the end of the sentence. Despite the capability of LSTM to maintain long-term memory to a certain level, it is still a problem for sentence generation [4,1]. The study by Jia et al. [28] suggested a modification to the LSTM model by introducing semantic data as additional feedback to each LSTM block unit. With this modification, the model can better describe an image's content without diving into the common but unrelated phrases. This modification also made it possible to search various length regularization types for beam exploration, thereby preventing short sentences' preference and making the results better. Another effort by Zhang et al. [45] presented an automatic natural language description generation model for videos. This model is dependent on an LSTM based sentence generator and a 2-stream video representation-learning model. It also has a novel model for parallel video representation, which merges both motion boundary history frames and RGB frames; both frames are laden with complementary information from temporal motions and visual appearance. Notably, the suggested framework could learn the simultaneous combination of several feature streams effectively and perform end-to-end preparation of the whole model. A comparison of the model with the existing models for video description was done on 3 different datasets and the outcome showed the proposed model to perform better than the others. The study by Sarangi et al. [30] presented a generative AIA model that exploits both fronts' recent improvements. It uses a deep-CNN for image region detection.
The experiments showed the model to achieve better training and accuracy when coupling image illustration from a discovery model with the embedment of the feedback word; it was also observed that a good portion of the information contained in the last layer of the detection model disappeared when fed to LSTM decoder as a vector. This observation could represent the class probabilities and the bounding box of the detected objects by the information contained in the last fully connected YOLO models' layer. This information is not sufficiently rich.

E. STACKED AUTO-ENCODER (SAE) BASED ON AIA
As a form of unsupervised learning structure, an auto-encoder is also comprised of 3 layers -input, hidden, and output layers (refer to Figure 6). There are two aspects of autoencoder training: the encoder and the interpreter; the encoder is involved in transforming the participation information into concealed illustration, while the decoder is for the input data reconstruction from the hidden representation. The SAE is an NN with several auto-encoder layers commonly used as a DL method for dimensionality reduction [18] and feature learning [17, 22, 4, and 19].
An effective multi-model retrieval system was presented by Luo et al. [27] based on SAE. This system maps the extracted features (high-dimensional) from dissimilar media information types into a single low-dimensional space for metric knowledge. When using DNN to solve image annotation problems, image features are normally used as inputs to the model, while keywords serve as the model object. However, modeling the complex relationship between tags and features requires the application of several hidden layers. Having trained the DNN, it can then generate the appropriate keywords for new images. Because the performance of DNN is a function of its initial parameters, its optimization becomes a challenge. Zhou et al. [51] suggested two strategies for addressing data imbalance in image annotation. The study also proposed a robust, balanced and algorithm (RB-SAE) for improving the annotation effect of low-frequency tags to enhance the training of low-frequency tags. This model was also proposed to increase the annotation stability through enhanced training by a group in sub-B-SAE models. With this approach, the ability of the model to address the issue of data imbalance was ensured. Regarding the annotation process, the hypothesis of the native symmetry dataset of the unknown image was made by taking the unknown image as the starting point. Simultaneously, there was discrimination between the high and low-frequency image attributes to establish different annotation processes. The low-frequency images were annotated using the local semantic propagation algorithm (SP), while the high-frequency images were annotated using the RB-SAE algorithm. The attribute discrimination annotation (ADA) framework was also formed to improve the overall image annotation effect. SAE is one of the commonly used DL techniques; it has received much consideration in fault diagnosis and has been studied as a common aspect of DNN [17]. The study by Jia et al. [7] suggested a DNN-based SAE diagnose faults in planetary gearbox and roller bearing; the input for this model was the frequency spectra after undergoing Fourier transform. Another study by Guo et al. [49] utilized the multidimensional statistical attributes of raw vibration signals as input for SAE; this can be considered a feature combination. Liu et al. [8] used STFT-created normalized spectrograms as input to SAEs to diagnose faults in a rolling bearing. Another study by [42] presented a multi-view SAE (MVSAE) model with a sigmoid predictor for annotating images. Here, the features of images are utilized as inputs to the model, while keywords are the model objects. However, the modeling of complex feature-tags relationships requires setting several hidden layers. Since the performance of DNN is a function of its initial parameters, the pre-trained constraints were adopted for MVSAE prototypical optimization. Specifically, the SAE was first trained using the chromatic feature I as the input x to aid the preliminary likelihood dissemination D 1 of the keyword. Then, the SAE is retrained using I and D 1 as the new inputs x to aid the generation of the final probability distribution D 2 of the keyword. The last process is the derivation of the image keywords ∧ T from D 2.

VI. SUMMARY
The five types of DL based on AIA methods were discussed in the previous sections based on their concepts, models, algorithms, and associated problems. Some of the advantages of DL-based AIA methods were also pointed out to include the mass data, the ability to learn complex relationships, generate strong features, and the need for no manual selection. The other advantages include the derivation of sufficient side information and calculation of the alternative number of labels. However, some of the related problems of DL-based AIA methods include local optimum entrapment, the need for numerous training images, and the inability to control the training process. Conclusively, DL-based AIA methods are associated with both opportunities and challenges for AIA. The advancements in DL have brought large-scale improvements to the AIA routine on image datasets on the one hand, and on the other hand, certain challenges are still encountered in DL-based AIA methods. Among the challenges is the low efficiency of DL-based AIA procedures with increases in both the depth and breadth of DNNs [51]. Despite the capability of DNNs to learn complex input-output relationships, they still suffer from local optimum entrapment and may not converge when using the BP algorithm.
Finally, irrespective of a combined RNN-CNN network's ability to solve label dependencies and label quantity prediction for large-scale image annotation, there is still a need to better rank label orders; RNN requires an ordered sequential list as input. The major challenge of the AIA techniques based on deep learning is related to high-dimensional feature analysis. All the existing features currently have the problem of not sufficiently describing the images. Another problem is the simultaneous performance of annotation and ranking in the existing methods, which is not ideal for efficient image retrieval. There is also an image ranking in each of the resulting categories from a single labeling approach to achieve better retrieval accuracy; this problem still opens new issues.

VII. DATASETS
A. COREL-5K [41] This dataset consists of 4500 preparation images and 499 testing images, with respective images marked with up to five labels (approximately 3.4 labels per image). The Corel-5K is one of the oldest datasets for image annotation.
B. ESP GAME [23] The ESP Game contains 18689 training images and 2081 testing images and each image is annotated with up to 15 labels (approximately 4.7 labels per image). The dataset was formed from an online game where 2 players are meant to assign labels to a given image, with a point scored for each common label. Several participants are involved in the manual annotation task; thereby, making it a challenging dataset.
C. IAPR TC-12 [42] This dataset contains 17665 preparation images and 1962 analysis images with respective images marked with up to twenty-three labels (approximately 5.7 tags per image). Each image has a lengthy depiction of several languages. Makadia et al. [38], [39] used the English language to extract nouns from the image descriptions and considered them as observations. Since then, it has been a widely used method for the evaluation of image explanation methods.

D. NUS-WIDE [43]
The NUS-WIDE is the prevalent freely accessible image annotation dataset. It comprises 269648 images, which were downloaded from Flickr and with 81 labels in the vocabulary. Each image in the dataset is annotated with up to 3 labels (approximately 2.40 labels per image). Based on earlier reports [10], [34], images with labels were discarded in this report, leaving a net of 209,347 images split into 125000 preparation images and 80000 images for analysis using the split method proposed by the authors of the dataset.
E. MS-COCO [44] The MS-COCO is next to NUS-WIDE in size and a popular image annotation dataset. It comprises 82783 preparation images with eighty tags, with the respective image being annotated with an average of 2.9 labels. Although it is not available publicly, it is used for image recognition.

VIII. EVALUATION METRICS
The dissimilar types of AIA methods' performance are evaluated using several evaluation metrics such as recall, precision, F1-score, and N+ [52].
• Recall and Precision: Given any keyword, let the number of images in the assessment dataset annotated with the label be m1. At the same time, m2 represents the number of appropriately annotated images with the label. Also, let m3 be the number of annotated images using the ground-truth data. In this case, the recall will be given as m2 m3, while the precision will be m2 m1.
The recall represents the relevant information retrieval capability, while precision measures uncorrelated information refusal capability. AIA models' performance is usually evaluated using a combination of recall and precision; however, the evaluation of AIA models' performance using only recall and precision is difficult because both metrics conflict with each other. Observe that AIA methods usually perform forced annotation of test images with k (generally 5) labels even when the images are associated with more or fewer labels in the ground accuracy. Thus, the values of recall and precision may be biased even when all the ground truth labels are predicted by the model [48].
• F1-score: This is calculated thus: F 1 = 2 * P * R P+ R. being that the performance of AIA models cannot be adequately evaluated using either the recall or the precision. There is a need to integrate them into one assessment catalog. Additionally, the F1-score can measure AIA methods' robustness, where a larger F1-score is suggestive of a more robust model.
• N+: This metric measures the number of correctly assigned keywords to at least one test image. It also portrays the number of keywords whose recall values are positive. Good performing AIA models usually present high N+ values. the best F1 score for all databases. Besides, the results did not reach the desired level on other databases than the rest of the algorithms. As we can see in Table 1, CNN-RNN [8] 55.65% 50.17% 52.77%, respectively, the performance of CNN-RNN [8] has significant improvement with the best precision Compared to remaining algorithms.

X. PERFORMANCE COMPARISON OF ANNOTATION METHODS
This section presents a comparison of the performance of some typical models, such as CNN, RNN, LSTM, & SAE, whose details are mentioned in Table 2. An assessment of the five types of AIA methods was offered based on their original idea, feature mining technique, annotation correctness, computational complexity, and datasets. Moreover, the assessment metrics used to assess AIA methods were reviewed and the advantages and issues associated with each technique were explained. The major challenge of the AIA techniques based on deep learning is related to high-dimensional feature analysis. All the existing features currently have the problem of not sufficiently describing the images. There is no existing feature that is significant enough to represent the high variation between images efficiently. The AIA technique's challenge is to reduce the semantic gap between low-level visual image features captured by machines and high-level semantic concepts perceived by humans. Many studies have been conducted on mining the image-image, image-label, and label-label correlation. Open issues, such as class-imbalance and weak-labeling of the training dataset [52]. Another problem is the simultaneous performance of annotation and ranking in the existing methods, which is not ideal for efficient image retrieval. There is also an image ranking in each of the resulting categories from a single labeling approach to achieve better retrieval accuracy [54]. Finally, some of the research on image Annotation systems is inclined to accomplish high accuracy and squat recall; the strength of image annotation is to ensure a balance between precision and recall by making sure the recall scores improve without maintaining precision. On the other hand, some image annotation models take a long time and computational complexity in the training phase, making them computationally intensive when faced with large training datasets. It is necessary to highlight the need for careful consideration of these aspects when building new image annotation techniques and datasets for future tasks [53].