DeepAIA: An Automatic Image Annotation Model Based on Generative Adversarial Networks and Transfer Learning

Automatic image annotation (AIA) has been adopted in different applications such as image retrieval and classification. Deep Learning is used in AIA to extract image features and then convert these features into text descriptions and labels. However, conventional AIA models that employ deep learning methods suffer from various shortcomings, such as poor annotation performance. This work proposes an AIA model based on convolutional neural networks (CNNs), generative adversarial networks (GANs), and transfer learning. GANs have attracted a lot of interest because of its ability to generate data without explicitly using probability density. Thus, it has proven its usefulness in image annotation and image augmentation. In this work, an Auxiliary classifier-GAN (ACGAN) has been used, where the discriminator predicts the class of an image rather than taking it as a given input; therefore, the stabilization of the training stage is ensured, and the generation of high-quality images is provided. Transfer learning is also used to enhance the performance of the classification. The proposed model outperforms the best state-of-the-art models in terms of MiAP, F-measure and error rate using ImageClef, ESPGame and IAPR-TC12 datasets.


I. INTRODUCTION
Along with the internet growth, the proliferation and the easy access to image capture devices such as smartphones, cameras, and drones, raised the number of images on the web tremendously. Moreover, various kinds of social media become popular, where it allows users to freely share image-based content such as Snapchat, Twitter, and Instagram. For instance, approximately, about 100 million images are uploaded daily to Instagram [1]. Most of the images contain valuable information for businesses and organizations from the consumer interest in a product or fashion events to patients' x-ray in the medical field. When a good image retrieval technique is used, it can facilitate a lot of real-life aspects and have a huge impact on many levels [2]. To facilitate retrieving images that satisfy the demand of users, it is important to label images correctly and precisely.
The associate editor coordinating the review of this manuscript and approving it for publication was Mohamed Elhoseny .
To accomplish that, one of the best methods to manage large-scale image datasets is the Automatic Image Annotation (AIA) [3].
AIA aims at assigning annotations/labels to images that describe its visual content. AIA is divided into three approaches, which are text-based annotation, content-based annotation, and multimodal-based annotation. The first approach is the text-based image annotation [4], which indicates that the images are annotated based on the text assigned by users or the text that surrounds the images on the web pages. However, this process now is impractical due to the massive number of images on the web and inconsistent annotation could happen between two individuals on the same image. The second approach is the content-based image annotation, which concentrates on the low-level visual features such as color, shape, and texture in the process of annotating images. This approach suffers from a well-known issue called the semantic gap [5]. The third approach is the multimodal-based image annotation. This approach leverages VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ both of the earlier two approaches to solve the problem of the semantic gap. Multimodal-based models showed better results compared to the other approaches in image annotation. Deep learning-based (DL) models have recently shown significant development in the AIA task, particularly with large-scale data. It can efficiently work with large datasets and learn feature representation automatically; thus, handcrafted features became unnecessary. In the field of computer vision, one of the prominent methods is the convolutional neural network (CNN) [6], [7], where its structure is primarily based on multiple neural layers. The key to their success lies in the complex neural architecture that is capable of taking into account the global and the local characteristics of the input. Basically, CNN extracts rich hierarchical features from the image and produces probabilities of different possible labels.
Recently, another deep learning method that has had huge success in the field of computer vision, especially in the image/video generation field is called generative adversarial network (GAN) [8]. GANs architecture was inspired by a game called a two-player zero-sum game, where two players' cumulative is zero and the gain or loss of utility of each player is balanced. GAN basically consists of two neural networks namely the generative network and the discriminative network, where these two networks compete with each other, as shown in Figure 1. GANs succeed in different challenging tasks like generating animation, video frames, image generation, etc [9]. In this paper, an improved AIA model called DeepAIA is proposed. It relies on image augmentation using GAN and CNN. The CNN network extracts the visual features from images through a pre-trained deep CNN architecture. The CNN architecture is trained via a transfer learning technique to save up time and overcome the overfitting problem. Also, the GAN network is another main component of the DeepAIA model, which acts as a powerful technique of data augmentation to enhance the training of the CNN of the proposed model in an unsupervised manner. The GAN network tackles the problems resulting from overfitting and training with small-sized datasets in CNN architectures efficiently. In specific, this work adopts a pure data-augmentation method based on ACGAN to artificially synthesize new annotated images' samples and can solve the image augmentation problem of small-scale datasets. The proposed DeepAIA model has been evaluated using 4 large datasets. To the best of our knowledge, two of the datasets (ImageClef 2011 and ImageClef 2012) have never been used in the evaluation of deep learning-based models.
The core contributions of this paper are summarized in the followings: • A comprehensive review of existing AIA models and highlighting the strengths and weaknesses of each.
• Proposing an AIA model named DeepAIA to automatically annotate the images with multiple labels. This model maintains the functionalities of both the pre-trained model using different architectures and image generation using ACGAN; therefore, the problems resulting from overfitting or training with small-sized datasets are addressed.
• Testing the proposed DeepAIA model on ImageClef 2011, ImageClef 2012 datasets, which is the first time that they are used to test AIA models with CNN architecture.
• Effectiveness verification of the proposed ACGAN-CNN model by comparing its performance with state-of-arts models on four datasets: ImageClef 2011, ImageClef 2012, ESPGame, and laprtc12. The rest of this paper is organized as follows: Section II discusses the existing works conducted in the AIA field. Section III introduces the proposed DeepAIA model by explaining the details of its main stages. The implementation and experiment details are presented in Section IV. Section V discusses the experimental results. And Section VI concludes this work and presents some possible future directions.

II. RELATED WORK
With the assistance of the training set, AIA models are capable of learning the relationship between the visual content of the image and high-level image semantics. According to the training approach, deep learning annotation methods can be categorized into two categories, namely: training from scratch-based annotation, and transfer learning-based annotation.

A. TRAINING FROM SCRATCH-BASED ANNOTATION
Most of the annotation methods train their model according to their datasets from scratch, thus allowing the configurations of the model to be under control. The proposed AIA model in [11], relies on CNN and neighbor groups to annotate images using k-nearest neighbor (KNN) model that clusters similar visuals into groups. In the testing phase, the features extracted from a new image are compared to the KNN model to find its similar features. Then, the self-defined Bayesian model is used to assign the tags related to the neighbor set to a new image. However, the model is influenced by the size of the training set.
The last decade has brought significant development in the field of deep learning techniques that sufficiently tackle AIA tasks. Firstly, training the model in a semi-supervised way, where the training images are not fully labeled. Wu et al.
presented a model based on deep CNN to annotate images in a semi-supervised manner [12]. Images are sampled from the training image set that contains labeled and unlabelled images. These images are fed into three CNNs that share the same architecture and the same weights. Then, the learned feature representations are considered as activation in the ranking layer, while the Weighted Pairwise Ranking Loss (W2PR) loss layer takes the output of the ranking layer and classify.
Secondly, training in a supervised manner, where all the training images are fully labeled. Kiyokawa et al. proposed a fully automated annotation model based on CNN [13]. It uses a single visual marker along with a noise-masking to hide the marker to label images collected manually in automated factories. The labels for each object are identified based on using the IDs of the detected marker. However, in cases where the products are close to each other, the single marker method will fail to detect the product.
There are some hybrid-based deep learning methods that leverage different deep learning-based architectures. Feng et al. proposed a hybrid-based model to automatically annotate images [14]. They combined a CNN architecture to model the images and long short-term memory (LSTM) to model the user's tags, which will be concatenated using a multi-layer perceptron (MLP). In the end, a class distribution is produced in the SoftMax layer to predict the labels. In spite of these results, this model has considered the image annotation problem as an image classification problem.
With the intention of improving the training process and tackling the overfitting issue, the most common method is to increase the size of the dataset. Wang et al. presented a multitask voting model based on data augmentation that improves the accuracy of annotation [15]. The proposed model adopts CNN architecture along with an adaptive label to achieve the best number of labels using SoftMax. The authors proved that traditional data augmentation methods are not practical, were sometimes important parts of the image got lost. Thus, another deep learning data augmentation method presented recently called GAN, which is a powerful technique to generate new images in an unsupervised manner.
Adar et al. proposed a Deep learning annotation model using GAN [16], where collecting a lot of data is a challenge in the medical field. The model is based on deep CNN as a classification approach for liver lesions datasets. Adopting GAN in the model significantly improves the accuracy of CNN annotation, where it achieved an improvement of 7% over traditional data augmentation. Also, medical images sometimes are unlabelled, and the others are annotated at the image-level. Ke et al. proposed an end-to-end AIA model based on deep CNN (E2E-DCNN) [17], that deals with the feature learning and annotation in an end-to-end manner through the CNN method. They adopt the GAN method to enhance the annotation performance. The earlier mentioned attempts train the model from scratch based on the available datasets. It is known that training from scratch is time-consuming compared to a transfer learning approach that could save a lot of time and effort.

B. TRANSFER LEARNING-BASED ANNOTATION
Transfer learning has received an interest in the field of deep learning models and has proven to save a lot of training time. Also, it assists in AIA tasks such as multi-class and multilabel classification.
Raghu et al. proposed a multi-class deep learning-based model to classify types of seizures with a non-seizure electroencephalogram (EEG) by adopting deep CNN and following the transfer learning approach in training the CNN architecture [18]. Recognition of seizure types is crucial for the neurosurgeon to understand the cortical connectivity of the brain. Baltruschat et al. proposed a multi-label deep learning-based model to label diseases on chest X-ray images [19]. The proposed model adopted CNN architecture under the transfer learning approach. Also, they considered non-image data information such as (gender, and age) along with the image information to train the model. They proved that integrating patient information is a useful process and enhances classification.

III. DEEP AUTOMATIC IMAGE ANNOTATION MODEL
The proposed DeepAIA model is an end-to-end AIA based on integrating data augmentation method (i.e. an auxiliary classifier GAN (ACGAN)) and CNN classifier, as shown in Figure 2. The DeepAIA model consists of three main stages, namely: (i) data preparation, (ii) training, and (iii) testing. The following sections discuss the main stages of the proposed DeepAIA model in detail:

A. DATASET PREPARATION
In any machine learning method, the process of data preparation and transformation can have a significant impact on the success of the used method and can facilitate the process of learning. Since the proposed model is based on deep learning methods; deep GAN and deep CNN, the input of these methods should be in a specific format [20]. The output of this stage is a pre-processed dataset, which is used as input for the next stage (i.e. training stage).

B. TRAINING
The goal of the training stage is to train the classifier efficiently. It consists of three main phases, namely: (i) synthetic image augmentation, (ii) transfer learning, and (iii) training validation and parameters fine-tuning, which are discussed in detail in the following subsections.

1) SYNTHETIC IMAGE AUGMENTATION
In this research, synthetic image augmentation is used to augment the training set of the selected datasets to further strengthen the learning process of deep CNN architecture and mitigate the problems associated with training using smallsized datasets. A deep ACGAN image augmentation network is adopted in this research because it proves to have more VOLUME 10, 2022 stabilization and higher quality synthesized images compared to the previous GAN architectures.
Basically, the GAN framework comprises the generator network G and the discriminator network D [21]. The G aims at generating fake images X fake = G(z) from a noise vector z, this noise vector is of a fixed length and is drawn randomly from a Gaussian distribution in the latent space. On the other hand, the goal of the D is to differentiate the real image from the synthetic image through probability P(O|X ) = D(X ), where the O denotes the origin of the image. Figure 3 illustrates the architecture of DeepAIA. First, images with N labels from the dataset are considered real images, which are going to be as an input to the ACGAN part of the model.
The generator is basically a deconvolution network that will take a random noise along with the class label as an input to generate fake images. Then a mini-batch of the real images along with a mini-batch of fake generated images will be taken as an input to the discriminator, where it will determine the authenticity of the image as if it is synthesized or not, besides reconstructing the class label of the image by using an auxiliary decoder [22] to stabilize the network. Basically, the discriminator consists of a set of second convolutional layers backed up with a Leaky ReLU non-linearity [23]. If the input image is classified as fake the discriminator will calculate feedback to update the network to get better at discriminating in the next round, the same is for the generator, where it will be updated based on how well the synthesized samples fooled the discriminator and generate more realistic images in the next round, in which the ACGAN will loop through E epochs. Finally, the generator will output synthesized images classified as real by the discriminator, which will be combined with the original images in the training set of the dataset to train CNN through transfer learning in the next step of the model.

2) TRANSFER LEARNING
Transfer learning is a method of machine learning in which a model trained on a specific task is reused for a second task. The transfer learning technique has the advantage of reducing the training time and can result in less generalization error [18], [19]. Basically, when removing the classifier layer of a pre-trained CNN architecture, the model will take an image as input and output feature maps. Then, a new classifier layer will take the feature maps as input and learn the new task of annotating images with new labels.
As illustrated in Figure 3, the input to the CNN classifier will be the training set from the dataset along with synthesized image augmentation from ACGAN. Thus, the training of the new classifier of CNN will be empowered to a great extent, where the training set carries data points and noise. By leveraging the merits of transfer learning and ACGAN augmentation, the CNN classification is enhanced.
The main goal of synthetic image augmentation is to synthetically increase the number of samples used in training to ameliorate the performance of CNN [24]. As the GAN mainly grasps the inherent distribution of the data from a set of examples, to produce synthetic images from the learned distribution. Then, once the distribution of each label is learned, the synthesized images are produced using a normally distributed noise as an input vector [24].

3) TRAINING, VALIDATION AND PARAMETERS FINE-TUNING
CNN network has a number of parameters to be set up to avoid getting prone to configuration errors when using a manual tuning of the parameters. Besides, deep learning networks may face an overfitting problem, which means the network starts memorizing the training dataset. Therefore overfitting degrades the generalized performance of the network [25]. One of the well-known techniques to mitigate the risk of overfitting in CNN architectures is K-fold crossvalidation [26]. In addition, K-fold cross-validation is used to evaluate the deep learning model while the model is in the process of parameter adjustment. Moreover, it includes splitting the training data into K number of partitions (usually K=4) [27].
In the Test stage, it aims to test the trained classifier using unseen images. The output of ACGAN is the generated image dataset that is compatible with the shape and size of the original dataset. All images (original dataset and generated dataset) are used to train CNN classifier. After the fully connected layer in the CNN classifier, the classification (i.e. annotation) is executed using the activation function based on the ground-truth set. Last, each image included in the testing set is labeled with one or more labels (i.e. class) based on class score, as shown in Figure 4. As mentioned earlier, the proposed DeepAIA aims to annotate images with multiple labels. Thus, the densely connected layer is adopted with the sigmoid activation function, which is an activation function for multi-label classification problems similar to the problem targeted in this research. Sigmoid function has a faster variance rate [28], also it has been adopted in this study since it belongs to [0 to 1].

IV. EXPERIMENTS
Throughout this section, we describe details of experiments with the proposed DeepAIA on four public annotation datasets. Further, we describe the experimental settings that include datasets, environment, parameters, and evaluation metrics. Also, we show the experimental results and analyses of the DeepAIA model. Then, compare the DeepAIA model with previous models in the field of AIA.

A. DATASETS
The proposed model is evaluated on four common, public, and large datasets from the field of image annotation: ImageClef 2011 [29], ImageClef 2012 [30], ESPGame [31], and Iaprtc12 [32]. The images in these datasets are of different categories such as color, weather, vehicles, natural, etc., making the annotation a difficult task. The Datasets with their ground truth are available at (https://zenodo.org/record/5570889#.YWoC3EZBw1I). The summary information of the four datasets are shown in Table 1.
The selected datasets are prepared in order to be used as an input for DeepAIA in the required format. Since images are of different sizes, all images in the datasets have been uniformly resized, where MobileNet and ResNet-101 require a 224 × 224 image input shape, while Inception requires a 299 × 299 image input shape.

B. ENVIRONMENT AND PARAMETERS
The environment employed in the experiments was as follows: windows 10 64-bit operating system, x64-based processor, intel(R) Core i7, CPU @ 2.80 GHz, 16 GB memory. The primary programming language used to build and execute the experiments was python.
All deep learning-based models contain a number of parameters not trained by the training set called hyperparameters, which influences the accuracy of the model. The ACGAN framework is associated with a set of hyperparameters that can affect the accuracy of the resulting augmentation. There are works that experiment with tuning these hyperparameters on a problem similar to ours. Thus, the batch size and latent size values were considered from [22]. In addition, Adam learning rate, and Adam beta values were considered from [33] along with LeakyReLU non-linearity on the discriminator. CNN architecture is also associated with a set of hyperparameters which can affect the accuracy of the resulting annotation. The learning rate value is set to 0.001 and batch size is set to 128 [34], on all selected datasets.

C. EVALUATION METRICS
To determine the effectiveness of the proposed model, it is evaluated with the most commonly used metric in image annotation including:

1) MEAN INTERPOLATED AVERAGE PRECISION (MiAP)
MiAP is a metric used to evaluate the performance per label [35]. MiAP is computed based on equation 1.
where the MiAP is obtained by interpolating the precision only at the 11 levels r taking the maximum precision whose recall value is greater than r, as illustrated in equation 2 and 3.
where n denotes the total number of labels, AP interp denotes average interpolated precision, P interp denotes interpolated precision, p(r ) is the precision at recall r .

2) PRECISION (P)
Precision is a metric measuring the number of positive predictions that is made correctly [36], as in the following equation: TP indicates the true positive, FP indicates the false positive.

3) RECALL (R)
Recall is a metric that quantifies the number of correct positive predictions from all positive predictions that could have been made [36], as in the following equation: FN indicates a false negative.

4) F1
It is a weighted average of the precision and recall. It is used to evaluate the performance per image as shown in equation 6.
where P denotes precision as in equation 4, and R denotes recall as in equation 5.

5) AREA UNDER THE CURVE (AUC)
AUC is used in the ImageClef annotation task to evaluate the performance of the annotation model per label. The AUC is computed using the height of the recall values by the false positive rate [37].

6) EQUAL ERROR RATE (EER)
It is also used in the ImageClef annotation task to evaluate the performance of the annotation model per label. EER is computed where the false positive rate (FPR) and false negative rate (FNR) intersect at a certain point [38]. FPR and FNR are calculated using equation on 7 and 8 respectively.
where v denotes the total number of negative elements, FP denotes the total number of misannotated positive elements, and FN denotes the total number of misannotated negative elements.

V. RESULTS AND DISCUSSION
The proposed DeepAIA model was implemented using transfer learning of three deep CNN architectures (MobileNet, ResNet-101, Inception), as mentioned earlier. After training and testing DeepAIA on the four selected datasets the results came as follows: the best-scored values were achieved with Inception CNN architecture for both ImageClef 2011 and Image-Clef 2012, as illustrated in Table 2 with the selected  evaluation metrics by strictly following the testing guidelines in ImageClef datasets. while the best-scored values were achieved with MobileNet and Inception CNN architectures for ESPGame and IAPR-TC 12 datasets respectively, as illustrated in Table 3.

A. COMPARISON AGAINST THE STATE-OF-THE-ART MODELS
To demonstrate the improvement that DeepAIA achieved, a comparison against the state-of-the-art models was conducted. For ImageClef 2011, the proposed DeepAIA model achieved results that outperform all the state-of-the-art models considered in Table 4 in terms of MiAP, F-measure, and AUC. The proposed DeepAIA model outperforms even the best-scored results achieved by the multimodal model proposed in [39].
For ImageClef 2012, the proposed DeepAIA model achieved results that outperform all the state-of-the-art models considered in Table 4 in terms of MiAP. Where MiAP was the only metric publicly available for all models considered in the comparison. The proposed DeepAIA model outperforms the best-scored result achieved by the multimodal model proposed in [40].
For ESPGame and IAPR-TC 12 datasets, the evaluation metrics illustrated in the experimentation were precision, recall, and f-measure. The results achieved with the proposed model are compared to the results of the state-of-theart models in the annotation task. The proposed DeepAIA model achieved results that outperform all the state-of-theart models considered in Table 5. The proposed DeepAIA model outperforms the best-scored results achieved by the E2E-DCNN model proposed in [17] when considering both ESPGame and IAPR-TC 12 datasets.

B. DISCUSSION
The experimental results illustrate DeepAIA's capabilities of performing multi-label annotation for given images. The model has effectively succeeded in outperforming in all the  datasets when compared against other studies, as shown in Table 4 and 5.
In ImageClef 2011 dataset, the DeepAIA model compared to the best-scored model [39] achieved a gain of 5.61% in terms of MiAP, a gain of 16.15% in terms of F-measure, and VOLUME 10, 2022 achieved a 69.8% reduction in terms of EER. However, for AUC the best-scored result was achieved by the ''LIRIS'' model [44], in which the DeepAIA achieved a result near the best-scored result and higher than most of the other models. While in ImageClef 2012 dataset, the proposed DeepAIA model achieved a gain of 30.05% in terms of MiAP compared to the best-scored model [39].
For the ESPGame dataset, compared to the best-scored model, the proposed DeepAIA model achieved a gain of 130% in terms of F-measure. Moreover, for the IAPR-TC12 dataset, compared to the best-scored model, the proposed DeepAIA model achieved a gain of 119% in terms of F-measure.

VI. CONCLUSION
With the explosive growth of digital images, the need to describe the images at a semantic level to facilitate indexing and arranging large-scale images has increased. Thus, the automatic interpretation and the uncovering of important information included in the image through annotating it with accurate labels is the main task of AIA. As well, accurate retrieving of images on demand is one among many real-life AIA applications. Consequently, different approaches varying from statistical methods to newly deep learning (DL) have been used to get the best possible performance on all kinds of datasets.
In this work, a DeepAIA model capable of automatically annotating large-scale images was proposed. The framework of the DeepAIA model adopts a well-known technique in image classification and annotation called CNN. In this research, the learning of CNN architecture was transfer learning of various pre-trained CNN architectures, that have proven to contribute to fairly increasing the model performance. Also, the benefit of data augmentation through a well-known technique called GAN was exploited, where data augmentation enriches the training set for better learning of the CNN architecture. The results of testing the Deep AIA on four different datasets were reported. By comparing with other models, DeepAIA outperformed state-of-the-art results on the four selected datasets.
The possible direction of research is to further experiment with different kinds of GAN architectures for data augmentation. In addition, experimenting with different pre-trained CNN architectures such as Googlenet, Lenet, and xception. On the more advanced level, the AIA task that DeepAIA solved can go beyond the simple annotation of image content to identify complex types. For instance, the task could be the identification and classification of weather conditions such as hot, cold, and humid or emotional states such as happy, sad, or confused. Besides, the task of DeepAIA can be extended to live annotating videos.