Automatic Image Annotation by Sequentially Learning From Multi-Level Semantic Neighborhoods

Automatic image annotation is a key technology in image understanding and pattern recognition, and is becoming increasingly important in order to annotate large-scale images. In the past decade, the nearest neighbor model-based AIA (Automatic image annotation) methods have been proved to be the most successful in all classical models. This model has four major challenges including semantic gap, label-imbalance, wider range labels, and weak-labeling. In this paper, we propose a novel annotation model based on three-pass KNN (k-Nearest Neighbor) to address the aforementioned challenges. The key idea is to identify appropriate neighbors at each pass KNN. In the first pass KNN, we identify the several most relevant categories based on label feature rather than visual feature as traditional models. In the second pass KNN, we determine the relevant images based on multi-modal (visual and textual label) embedding features. As the test image has not been annotated with any label, we propose a pre-annotation strategory before image annotation to improve the semantic level. In the third pass KNN, we capture relevant labels from semantically and visually similar images and propagate them to the given unlabeled image. In contrast with traditional nearest neighbor based methods, our method can inherently alleviate the problems of semantic gap, label-imbalance, and wider range labels. In addition, to alleviate the issue of weak-labeling, we propose label refinement for training images. Extensive experiments on three classical benchmark datasets and MS-COCO demonstrate that the proposed method significantly outperforms the state-of-the-art in terms of per-label and per-image metrics.


I. INTRODUCTION
With the prevalence of digital photography and social networks in our daily lives, billions of images are generated and shared on the Internet. Users have access to a flood of images, making it a challenge to retrieve and manage the ones they care about from this vast ocean of available visual data [1]. Automatic image annotation (AIA) is the task of automatically assigning several textual labels to a given image based on its semantics. Recently, the AIA has been an active research topic in the fields of computer vision and The associate editor coordinating the review of this manuscript and approving it for publication was Long Wang . machine learning due to its great potential applications in image retrieval, image classification, image understanding, and image management [2]- [4].
In the past 20 years, a considerable amount of research effort has been made to devise automatic image annotation models. Some representative AIA approaches have been proposed and great achievements have been made, such as MBRM [5], JEC [6], 2PKNN [2], and D 2 IA [7]. Recently, significant advances have been achieved on largescale image recognition tasks [8], with deep learning models such as Convolutional Neural Network (CNN) and Generative Adversarial Network (GAN). In comparison with image recognition, image annotation is a more challenging task, since it is a multi-label multi-class classification problem [8], instead of a single-label multi-class classification problem as in image recognition. The four most difficult challenges of AIA are semantic gap, label-imbalance, weak-labeling, and wider range labels [2]. The semantic gap is the semantic difference between image low-level features represented by machines and high-level human perceptions used to perceive the image. The label-imbalance problem means there exists a high variance among the number of images corresponding to different labels, and this problem is quite common when the size of a dataset or label vocabulary is large. Today, the label-imbalance is a pending issue. Weak-labeling means that manual annotations are noisy, irrelevant, or incomplete. Weak-labeling will also cause poor-labeling. The wider range labels mean that labels in image annotation can refer to a much wider and more diverse range of concepts or drastically different levels of abstraction, such as concrete visual objects (cat, train), scenes (beach, city), amorphous background elements (sky, grass), or abstract concepts (scary, serene) [1], [9]. The traditional CNNs, designed for single-label image classification, are thus unsuitable for image annotation task because they fail to provide rich representations at different abstraction scales.
During the past decade, there have been several efforts for addressing such issues as MBRM, JEC, CNN+WARP [10], LTN [11], TagProp [12], 2PKNN, CCA-KNN, D 2 IA, and CNN-RNN [13]. Benefitting from deep learning features (such as CNN and GAN), most models based on deep learning can reduce the semantic gap, although the issue has not been resolved thoroughly. In single-label image classification, deep learning features are able to address the issue of the semantic gap. CNN has shown great performance as general feature representations for object recognition applications. However, for multi-label images that contain multiple objects from different categories, scale and location, global CNN features are not optimal. Some models close to success (TagProp, KCCA-2PKNN, SKL-CRM, SVM-DMBRM [14], and 2PKNN) are able to address the semantic gap problem with computationally expensive metric learning. Some models based on nearest neighbors, i.e. 2PKNN, can alleviate the label-imbalance by paying more attention to rare labels, which always improve the performance of the low frequency words by sacrificing high frequency words. Some researches attempt to provide multi-level deep features to provide highquality image features suitable for image annotation [9], [15]. Nevertheless, the most important challenge, i.e. weaklabeling problem, has never been tackled explicitly. In summary, the AIA is still a difficult and challenging task [2].
In this paper, we propose a novel image annotation method based on nearest neighbors to address the problems above mentioned. Different from the existing methods based on nearest neighbors, our method sequently learns from multilevel semantic neighborhoods rather than a single neighborhood as the existing methods. First, we complete labels to the training dataset by propagating from neighbors to overcome the problem of the weak-labeling. Second, we divide the training images which have similar textual label features into a semantic neighbor group by N-cut algorithm. Each group is considered as a category. Third, we pick up top N similar images from the three most similar categories according to their similarities in KCCA space, which is trained by visual features and completed labels of the training image dataset. The selected neighbors form a category-level similar neighborhood. Fourth, pick up top M similar images from categorylevel similar neighborhood according to visual similarity, and propagate labels from category-level and object-level similar neighbors by using their similarity as weight.
Our main contributions are as follows: (1) We proposed a novel annotation method based on three-pass KNN, which can accurately capture the relevant categories, relevant images, and appropriate labels for a given test image.
(2) We proposed multi-level semantic neighborhoods rich representations at different abstraction scales, suitable for image annotation tasks. (3) We proposed a pre-annotation strategy for the unlabeled test image to perform multi-modal image retrieval so as to reduce the semantic gap. (4) We proposed label refinement for training images based on their textual label similarities to alleviate the issue of weak-labeling. (5) Our proposed model can alleviate the problem of the labelimbalance by enhancing rare labels without sacrificing frequent labels. Extensive experiments are carried out on three benchmark datasets: (Corel5k, ESP Game and IAPR TC-12) and MS-COCO. The experimental results demonstrate that our model outperforms the state-of-the-art alternatives.

II. RELATED WORKS
A large number of automatic image annotation models have been developed. In this section, we briefly review the closely related and representative works.

A. GENERATIVE MODELS
The generative model-based AIA methods are quite popular in the early 21st century, and great achievements have been made. In the model training stage, the generative model aims at learning a joint distribution between visual and label features so that the learned model can predict the conditional probability of labels for features of a test image. The generative models focus on maximizing the generative likelihood of image features and labels. The generative models used for AIA mainly consist of relevance models, mixture models, and topic models. The representative models include CMRM, CRM, MBRM, and PLSA-WORDS. These models are usually expensive or require simplifying assumptions that can be suboptimal for predictive performances [16].

B. DISCRIMINATIVE MODELS
Discriminative models consider image annotation as a multilabel multi-class classification problem. These models regard each label as an independent class. A separate classifier is trained for each label with visual features of training images, and then the trained classifier can predict particular labels for a given test image. Most of discriminative models VOLUME 9, 2021 are based on support vector machine (SVM) or its variants [17]. The other representative models include SML [18], DMBRM, and SVM-DMBRM. The SML model learns classspecific distributions for each label. The SML model treats the image annotation as a multi-classification problem and learns a class-specific distribution for each label [18]. The SVM-DMBRM model is a hybrid method that takes full advantages of the merits of both generative and discriminative models [14]. While SVM tries to solve the weaklabeling issue, DMBRM strives to solve the class-imbalance issue. However, these multi-label classification approaches are unscale to a large number of labels [17].

C. NEAREST NEIGHBOR BASED MODELS
The nearest neighbor based models have become a more and more widely used method for AIA due to their effectiveness and simplicity. These models explore the visual similarities between a test image and training images, and finally assign labels to the test image by sorting the scores of neighbors of visually similar images [19]. The representative models based on nearest neighbor include JEC [6], TagProp [12], and 2PKNN [2].
The Joint Equal Contribution (JEC) model is one of the most classical nearest-neighbor models [6]. The JEC model utilizes various low-level image features and a simple combination of basic distance measures to find the nearest neighbors of a given image. It creates a family of very simple and intuitive baseline methods for image annotation [20]. Guillaumin et al. presented the TagProp [12] method, which learns the weight of each feature group and uses the label relevance prediction to annotate images [2]. The TagProp promotes rare labels and penalizes frequent ones by training a logistic model, which alleviates the class-imbalance problem. Its great achievement largely benefits from metric learning.
The 2-pass k-nearest neighbor(2PKNN) model represents a classical solution to solve problems related to the labelimbalance and the weak-labeling [2]. It is a two-pass variant of the traditional KNN. In the first pass, the image-label similarity is used, while image-image similarity is used in the second pass. It identifies the k most similar semantic neighbor images for each label in the vocabulary. Due to its successfully solving the label-imbalance problem, the 2PKNN makes great achievements and is still one of the most influential image annotation approaches.
The conventional deep networks can be subjected to the decayed performance if we have insufficient training examples. Shu proposed weakly-shared Deep Transfer Networks (DTNs) to mitigate the problem by bringing in rich labels from the text domain [26]. The proposed model can translate cross-domain information from text to image. Tang proposed a novel generalized deep transfer networks (DTNs), capable of transferring label information across heterogeneous domains, textual domain to visual domain. The proposed framework has the ability to adequately mitigate the problem of insufficient training images by bringing in rich labels from the textual domain [27].
The CNN+WARP [10], proposed by Jia (the creator of Caffe), is the first attempt to leverage CNN features to solve the image annotation task. The CNN+WARP adopts a weighted approximate ranking loss function for training to promote image annotation performances. The VCCA [28], an image annotation method based on a deep multi-view learning model, extends linear CCA to nonlinear observation models parameterized by deep neural networks.
Different from these classical end-to-end deep learning features, Yu proposed to extract middle-level features from a deep learning model to accurately depict semantic concepts for image annotation [15]. This model can improve annotation performance with expensive time and space cost. Recently, the CNN-RNN (convolutional and recurrent neural networks) encoder-decoder architectures are jointly adopted to image understanding, where the CNN subnetwork encodes the input pixels of images into visual features, and the RNN subnetwork decodes the visual feature into a label prediction path [29], [30]. The CNN-RNN model uses the output fusion to merge CNN output features and RNN output [29]. The D 2 IA, image annotation method based on generative adversarial network (GAN) model, aims to create semantically relevant, yet distinct and diverse labels [7].
MangoNet [31] is a novel deep learning based image annotation model that combines co-attention mechanism and graph convolutional network (GCN). It explores image neighbors by measuring their metadata similarities and utilizes a graph network to model the correlations between the target image and its neighbors. To accurately capture the visual clues from the neighborhood, a co-attention mechanism is introduced to leverage the visual attention within the neighborhood. However, the GCN model increases the space and time complexities, which will be unfavorable to apply to large-scale databases.
Despite their relative success, most of deep learning based models suffer from the single abstraction level. As we knew, the labels in AIA are always much wider and more diverse range of visual objects or abstract concepts with different abstraction levels, while the ones in image recognition (classification) are always concrete visual objects with the same level. Most CNN models, designed originally for the image classification task, are unsuitable for the image annotation task because they fail to provide rich representations at different abstraction scales. As a result, only CNN features could alleviate the problem of the semantic gap in AIA rather than solve the problem thoroughly.

III. OUR PROPOSED METHOD A. OUR FRAMEWORK
To resolve problems of the weak-labeling, models close to success focus on metric learning (such as TagProp and SVM-DMBRM), which always require computationally expensive metric learning approaches. 2PKNN propose a novel two-pass KNN solution to address the issue of the label-imbalance by considering the image-label similarity and image-image similarity in the two passes, respectively. Although 2PKNN significantly improves the per-label performance, it always sacrifices high frequent labels. In fact, 2PKNN could not improve the annotation performance as shown by per-label evaluation metrics. Figure 1 shows our proposed framework for automatic image annotation. The proposed framework is composed of two main components, i.e. training and testing processes. The training process includes label refinement, category generation, and KCCA model training and embedding features.
To resolve the problems of the weak-labeling and the labelimbalance, we propose a novel image annotation method based on nearest neighbors. In contrast with traditional NN based models and classical 2PKNN which need computationally expensive metric learning, we propose a novel and simple label refinement to address the weak-labeling. Rather than in traditional visual feature space, our proposed method refines labels for all training images in the label feature space, which can inherently address the problem of the semantic gap. Our proposed method divides the images which have similar features in the label feature space into a semantic neighbor group called a category. Our proposed method maps visual feature vectors extracted by deep learning architecture (pre-trained VGG-16), and refines label vectors to a common feature space by the KCCA model. New visual features and new label features are used in the test stage.
Similar to 2PKNN, our proposed method is a three-pass variant of the traditional KNN. Given a test image, in the first KNN, our method aims to find the K 1 most relevant categories based on label features. In the second KNN, we select the K 2 most similar images from each relevant category and combine them into a single neighborhood including relevant images. These relevant images are similar to the test image in both embedding visual feature and embedding label feature. In the third KNN, based on two steps, we find the K 3 most visually similar images and propagate a fixed number of labels to the test image according to their original visual feature similarities.

B. LABEL REFINEMENT
To alleviate the shortcoming of the weak-labeling, most methods devise sophisticated models with expensive time and space cost in annotation process.Tang proposed a novel tri-clustered tensor completion framework to collaboratively explore these three kinds of information to improve the performance of social image tag refinement [32].Tang proposed a novel Social anchor-Unit GrAph Regularized Tensor Completion (SUGAR-TC) method to efficiently refine the VOLUME 9, 2021 tags of social images, which is insensitive to the scale of data [33].
Our method compensates some appropriate labels for each training image based on its neighbors' labels. To start with, our method directly completes labels for each training image in the label feature space. More specifically, if the associated textual labels associated with an image (I i ) can be considered as another modal feature for the image, the textual feature vector of the image can be represented as follows: where M is the volume of labels (e.g. 260 for Corel5k), if the image I i has been manually annotated with the corresponding k-th label and 0 otherwise. For each training image, if the number of its initial labels is smaller than the target number (M ), we can compensate several labels to the image by propagating labels from the neighbourhood in the label feature space. The similarity measure between the image I i and the image I j is based on L2 distance, as follows: We choose K neighbor training images in the textual label feature space for each current image I i and rank the labels for image I i according to their probability scores of: where sim text (I i , I j ) is the textual similarity between I i and I j (as shown in Equation 2). Based on probability theory, the probability of assigning a label w k to I i can be defined the posterior probability as follows: where P(w k ) is the probability of the label w k . The best label for the test image I i will be given by the following: We consider the top M − |I i | labels as refined labels for the current image I i , |I i | is the number of original manual labels associated with the image I i . In addition, the posterior probability of specific label w i , computed by Equation 4, is considered as its confidence in the image's label vector. After label refinement, several zero elements of the textual label feature vector of the image (as shown in Equation 1) are replaced by non-zero probability scores of corresponding labels. As a result, the refined label vector is real-valued rather than discrete (or binary) as the original label feature.
In the training stage, the proposed method divides the images which have similar refined label features into the same category by using k-means algorithm. The center of Input: 1) Q: the binary image-label matrix, Q ∈ B N×M , N is the number of the training image dataset, M is the number of labels in the image dataset. 2) trainingImgSet: the training image dataset. 1: assign Q to NQ, NQ ∈ R N×M , NQ is a real-valued imagelabel matrix 2: for I i in trainingImgSet do 3: set sim Vector = φ 4: set neighborhood = φ 5: for I j in trainingImgSet do 6: compute sim = sim text (I i , J ) with Equation (2)   7: set sim Vector (j) = sim 8: end for 9: sort sim Vector in descending order 10: assign top K elements of sim Vector to sim KNN , and their corresponding images to the neighborhood 11: for k in [1, M ] do 12: compute P(w k | I ) with Equation (4) 13: end for 14: sort P(w k | I ) in descending order 15: assign the largest (5-|I i |) probability scores of P(w m | I i ) to NQ(i). 16: end for=0 Output: NQ N×M : the refined image-label matrix. a category (k) is defined as the mean of all images' label features in this category, denoted as:

C. FEATURE EXTRACTION AND REPRESENTATION
The feature vectors in our method are different at different stages. The original visual feature is extracted by pre-trained architecture (VGG-16). We also consider the label information associated with the image as its another modal feature.
To promote the semantic level of the image feature vector, our model utilizes semantic embedding to properly map refined labels and visual features to a meaningful semantic space by using kernelized canonical correlation analysis (KCCA). In the training stage, the KCCA model can be learned from the original visual feature and the refined label feature. Then, the learned KCCA model can map the original visual feature and the textual label feature to a common meaningful semantic space, where the (embedding) new visual feature and the (embedding) new label feature can be generated.
Given the two views (visual modality and textual label modality) of the images, a common representation can be constructed by KCCA model. KCCA seeks to utilize images consisting of paired views to simultaneously find projections from each feature space so that the correlation between the projected representations is maximized. For given N training pairs of visual and refined label features t 1 ), . . . , (v i , t i ), . . . , (v N , t N )}, the idea is to simultaneously find directions w v and w t that maximize the correlation of the projections of φ v onto w v and φ t onto w t [16], [24]. The φ v and φ t mapping is achieved using kernel function Thus, KCCA is to search for solutions of w v and w t as a linear combination of the training data: The objective of KCCA is thus to identify the weights α, β ∈ R N that maximize the objective function [24]: where K v and K t denote the N × N visual and textual kernel matrices over a sample of N pairs, respectively.

D. LABEL PROPAGATION BASED ON MULTI-LEVEL SEMANTIC NEIGHBORHOODS
The image annotations (labels) always cover drastically different levels of abstraction semantic concepts including image categories, scenes, abstract concepts, and concrete visual objects [9]. Currently, most methods based on the nearest neighbor model measure the similarities between the test image and training images only based on singlelevel visual or semantic features, which fail to provide rich representations at different abstraction scales and could not depict multi-level semantic concepts. Consequently, many noisy images, their content irrelevant to the test image, are considered as neighbors and involved to labels propagation based on neighbors. Since noisy images may worsen the annotation performance, it is necessary to get rid of them.
In contrast with traditional annotation models based on single neighborhood directly selected from the whole training image dataset, we propose a novel annotation method based on multi-level semantic neighbors. Given an unlabeled test image, we proposed a preannotation strategy before image annotation. The preannotation strategy assumes several labels by propagating visual neighbors' labels to the test image by weighted KNN, whose weights are visual similarities between the image and neighbors. Similar to the label refinement method (Equation 2-5), we can assign 5 labels to the test image.
The only difference between the pre-annotation and the label refinement is that the pre-annotation step determines neighbors based on original visual features instead of textual label features. We use the pre-annotation labels as the test image's labels in the following process until the final annotation labels are predicted.
Our proposed method includes three KNN steps. First, our proposed model computes image-category similarities to determine certain categories semantically relevant to the test image. The image-category similarity is the similarities of label vectors between the test image and all categories' centre. If the image-category similarity is larger than the specified threshold, the category is considered as relevant one. Second, compute multi-modal (including new visual feature and new label feature) image-image similarity to capture visually and semantically similar neighbors in each relevant category. Third, combine KNN images of all relevant categories into a single neighbourhood set including relevant images. Finally, assign the N most relevant labels to the test image based on visual similarities between the test image and relevant images.
In the first KNN, given a test image, our goal is to select the K 1 most relevant categories. We can define the similarity between a test image (I ) and a category (m) as follows: where C m is the center of the category m. We consider the K 1 most similar categories as relevant categories. In the second KNN, our goal is to pick the K 2 most similar images from each relevant category to combine relevant images, which are visually and semantically similar to the test image. We define the multi-modal similarity between the test image (I ) and training image (J ) as follows: where In the third KNN, our goal is to find the K 3 most visual similar images and assign their labels to the test image. To focus on depicting local visual features, we measure image similarity based on the original visual feature rather than embedding one, whose metric function is cosine similarity. We choose K 3 neighbor training images in original visual feature space for each current image I i and rank the labels for I i according to their probability scores of: where sim vis (I i , I j ) = cos(v i , v j ), v i and v j are original feature vectors, while P(w | I j ) is a refined label feature. All labels' probability scores for the image I i ,(P(w | I i )),can be predicted with Equation 4. After a group of images are automatically annotated, we regularize these probability scores. First, the probability score for each image I i is regularized using rownormalization as follows: Then, probability scores of for the group images are regularized using column-normalization as follows: At last, the final annotations can be selected with Equation (5).

IV. EXPERIMENT A. DATASETS
We conducted our experiments on three benchmark datasets including Corel5k, ESP Game, and IAPR TC-12. The images in these datasets are of various categories such as natural scene, game, sketches, personal photos and so on, which makes the annotation a challenging task.
Corel5K is the first and also the most widely used dataset for evaluating image annotation. It was first used by Duygulu et al. in 2002, and since then it has become a de facto evaluation benchmark for comparing the annotation performance [2], [34]. It consists of 4500 training images and 499 test images. Each image is either 192 × 128 or 128 × 192 pixels. Each image is annotated with up to 5 words (labels), with 3.5 labels on average from a dictionary of 260 labels.
ESP Game dataset was published by von Ahn and Dabbish in 2004. The dataset consists of 18689 training images and 2081 test images. Each image is manually annotated with up to 15 labels, with 4.7 labels on average from a dictionary of 268 labels. The dataset images are annotated by game player using an online game. The two mutually unknown players are required to predict the same keyword(s) to score points for a randomly given image, which makes this dataset quite challenging and diverse.
IAPR TC-12 dataset was introduced by Grubinger for cross-lingual information retrieval in 2007. Each image is initially associated with a long description. The English nouns extracted from the descriptions by Makadia [4], [12] are treated as annotations. The dataset consists of 17665 training images and 1962 test images. Each images is 480 × 360 or 360 × 480 pixels. Each image is manually annotated up to 23 labels, with 5.7 labels on average from a dictionary of 291 labels. The dataset has been widely used for evaluating image annotation models.
The famous large-scale datases include NUS-WIDE [35] and Microsoft COCO(MS-COCO). There are many works on NUS-WIDE [29], [31], and all of them remove some noisy tags and images to obtain clean image dataset. However, the refined image datasets are different. Therefore, we conduct experiments only on large-scale dataset MS-COCO. The MS-COCO dataset is used for image recognition, segmentation, and captioning. It contains 123 thousand images of 170,339 user provided noisy tags and 80 expert-provided ground truth labels. Following previous works [12], [36], we only keep 1,000 frequent tags and remove the images without any expert label, which leaves us with 123,286 images including 82,782 training images and 40,504 test images; each image being annotated with 2.9 labels on average. The refined MS-COCO dataset is the same as some research works [12], [36].

B. EVALUATION METRICS
The per-label evaluation metrics have been widely used to evaluate image annotation approaches in the past two decades. Today, the per-label evaluation metrics have been considered as standard evaluation metrics. The per-label evaluation metrics include precision, recall, and F1-measure. For each label, per-label precision is defined as the number of images correctly predicted over the total number of images predicted with this label, and per-label recall is defined as the number of images correctly predicted over the total number of images having this label in its ground-truth or manual annotations. These values are averaged over all the labels in the vocabulary to get average (percentage) perlabel precision (P L ) and average per-label recall (R L ) respectively. From these scores, we compute the average per-label F1-measure (F1 L ), which is the harmonic mean of P L and R L . The per-label precision is defined as, where TP is the number of images that contain the label in manual annotations and are correctly predicted the label by annotation model. FP is the number of the images that do not contain the label and are incorrectly predicted the label. TP + FP equals to the total number of images predicted the label by model. The per-label recall is defined as, where FN is the number of the images that contain the label in manual annotations and are not predicted the label by the model. TP + FN equals to the total number of the images containing the label in the manual annotations. F1-measure combines P with R, indicating the integrated result. F1-measure is used for comprehensive performance evaluation by combing precision and recall. The per-label F1measure is defined as, We also consider the N + metric, which counts how many labels in the vocabulary are correctly predicted for at least one on test images.
Besides per-label metrics, more and more researchers adopt per-image metrics (also including precision, recall, and F1-measure) to evaluate annotation performance [9]- [11], [20], [30], [37], the per-label metrics are biased toward infrequent labels because making them correct could have a very significant impact on final accuracy [10]. These values are averaged over all the images in the test dataset to get average (percentage) per-image precision (P I ) and average per-image recall (R I ), respectively. The per-image precision is defined as, where TP is the number of the labels that are contained in the image and are correctly predicted the label by annotation model. FP is the number of the labels that are not contained in the image and are incorrectly predicted the label. TP + FP equals to the total number of labels that are predicted by the model.The per-image recall is defined as, where FN is the number of the labels that are contained in the image and are not predicted the label by the model. TP + FN equals to the total number of the labels that is contained in the image. F1-measure combines P I with R I , indicating the integrated result. The per-image F1-measure (F1 I ), the harmonic mean of P I and R I . The per-image F1-measure is defined as, The mean average precision (MAP) is a widely used metric in the field of image retrieval [11], [12], [38]. The MAP includes per-label MAP (MAP L ) and per-image MAP (MAP I ), which take into account all labels for every image, and evaluate the full ranking. MAP L measures image-ranking quality corresponding to labels, whiles MAP I measures labelranking quality corresponding to images. MAP measures the full ranking of images instead of only the top labels for each image as traditional evaluation metrics [11]. Therefore, MAP L is less noisy and preferable to other per-label metrics.
Recently, more and more works use MAP as image annotation evaluation metrics [12], [39]- [41]. To more comprehensively evaluate annotation performance, we also use MAP L and MAP I as supplementary evaluation metrics for image annotation.

C. IMPLEMENTATION DETAILS
For a fair comparison, visual features of all methods except MBRM are extracted from the same deep learning network architecture (VGG-16), while MBRM is performed using a handcraft feature due to the model itself. For the PLSA method, we first extract convolutional features from Conv5_2 of VGG-16, and generate a 1000-dimension visual feature vector for each image by the k-means algorithm. For other methods, we use FC7 of VGG-16 to extract 4096-dimensional vector as a visual feature vector. The VGG-16 network used in this paper is pre-trained on the ImageNet2012 dataset [22] without retraining or fine-tuning on target datasets to demonstrate our model generality.
For nearest-neighbor based models, the number of nearest neighbors K is set to the optimum value for each model, such as JEC, TagProp, and ours setting as 15, while 2PKNN as 3. K of Equation 3 is set to 100. θ of Equation 13 is set to 0.8. The neighbor number K of our three-pass KNN is set to 3, 30, and 30, respectively.

D. RESULTS AND COMPARISON
For a fair comparison, we carry out our experiments on the same three benchmark datasets (Corel5k, ESP Game and IAPR TC-12) and predict a fixed length of annotations (five labels) for each test image. We compare our method and some representative methods using per-label metrics (precision, recall, F1-measure), per-image metrics, and MAP. Furthermore, we use the hybrid F1-measure (called H-F1) combining per-label F1-measure and per-image F1-measure with the harmonic mean [9]. We compare our method with state-of-theart models, including classical probabilistic model MBRM, classical topic model PLSA-WORDS, classical CCA model CCA-KNN, two nearest-neighbor models JEC and TagProp, classical discriminative model SVM-DMBRM [14], and the state-of-the-art nearest neighbour based model 2PKNN. We also compare with GAN based D 2 IA annotation method. Only part of metrics of CCA-KNN, SVM-DMBRM and D 2 IA are compared in the following, whose performances are quoted from [7], [14], [24].
The experiment results on Corel5k, ESP Game, and IAPR TC-12 are summarized in Tables 1, 2, and 3, respectively. From Tables 1-3, we can see that our proposed method significantly outperforms all methods but D 2 IA on three benchmark datasets in terms of almost all metrics. Our performance improvement largely benefits from label refinement, multilevel semantic neighborhood.
To further evaluate the annotation performance, we varied the number of annotation labels from 2 to 20 and compared our method with competitive methods. Both per-label and per-image precision-recall curves of MBRM, JEC, TagProp, PLSA-WORDS, 2PKNN, and our method are visualized in Figure 2 based on three benchmark datasets. Both per-image and per-label precision/recall values are the mean values calculated over all the test images and all the labels, respectively. As can be seen from Figure 2, our model remarkably outperforms the others for almost any number of annotation labels. These again confirm the effectiveness of our method.
To compare with deep learning based image annotation models on large-scale datasets, we also carry out our experiments on MS-COCO dataset and predict a fixed length of annotations (three labels) for each test image. We compare our method with state-of-the-art models, including CNN+WARP, CNN-RNN, and MangoNet. We annotate images based on multi-modal (deep visual features and textual tag features) embedding features mapped by KCCA. The experiment results on MS-COCO is summarized in Table 4. As can be seen from Table 4, our method significantly  outperforms the other methods (non-deep as well as deep learning based methods) on large-scale datasets in terms of most evaluation metrics, which mainly benefits from highlevel semantic features and accurate neighbors. MangoNet outperforms our method in terms of percision or recall metrics, which might largely benefit from the co-attention and GCN model, as it can capture high-quality visual features and accurately model the correlations between each target image and its neighbors by metadata neighborhood graph. As far as the most important metric per-label F1 and comprehensive metric H-F1 are concerned, our method generally outperforms MangoNet.
The main reason of our proposed model performance improvement can be summarized as follows 1) We propose label refinement to alleviate the weak-labeling. 2) We address the issues of semantic gap and different levels of abstraction by our proposed multi-level semantic neighborhoods. 3) Our method outperforms most methods in terms of perlabel metrics by a large margin, which mainly contributes to our addressing the issue of label-imbalance. In contrast to the traditional NN models that pay more attention to frequent labels, our method pays more attention to the same category and relevant images rather than rare labels, which gives equal importance to all labels of the relevant images. As a consequence, our method can improve the annotation performance of infrequent labels without sacrificing frequent labels, thus improving performance in both per-label and perimage metrics. Table 5 shows several examples of annotations produced by JEC, 2PKNN and our method on the three datasets. The example images in the first three rows are from Corel5k, the second three rows from ESP Game, the third three rows from IAPR TC-12, and the last two images are from ESP Game and IAPR TC-12. As for most images,we can see that our method can correctly predicate the ground-truth annotations, although there are some extra labels. By checking the extra labels (with blue font), we find that most of them are all consistent with the content of the images but not included in  ground-truth labels. Our method can consider category-level, semantic information, and visual information in different steps so as to find visually and semantically similar images and predicate the correct annotation labels.

E. QUALITATIVE ANALYSIS
As for the tenth image, any method has not correctly predicated the ground-truth annotations. Both JEC and 2PKNN improperly predict ''white'', while our proposed method improperly predict ''mountain'', ''hill'' and ''lake''. As for JEC and 2PKNN, the relevance between the test image and training images completely depend on their visual similarities. Hence, they can predict ''white'' according to the visual feature of the sky. As for our proposed method, it first identify the image as scene category, and predict ''mountain'', ''hill'', and ''lake''. As for the eleventh image, only our method improperly predict ''skirt'' rather than ''wall'', while the others correctly predicated the ground-truth annotations. This is possibly because our method tends to the foreground object rather than the background.

F. EFFICIENCY ANALYSIS
To verify the efficiency of the proposed model, as shown in Table 6, we compare the time costs among MBRM, JEC, TagProp, 2PKNN, PLSA, and our model. The experiments are mainly performed using Matlab on a computer of Intel Corei7-9750H CPU with 2.6GHz and 16 GB RAM, running Windows 10 OS, but some components of Tagprop is based on C language. As shown in Table 6, the time costs of two models (TagProp and PLSA) can be separated the training stage and the testing stage, while those of JEC, LL-PLSA, and 2PKNN can not be separated.
As shown in In contrast to all nearest-neighbor models, the time overhead of our model is proportional to the number and the size of categories rather than the size of the entire training image database. Mostly, the number of categories is small and constant; therefore, our time cost is much smaller than others.   All in all, compared with other nearest-neighbor models, our proposed model is more fit to real-world online image repository or large-scale social image database.

V. CONCLUSION AND FUTURE WORK
We present a novel image annotation based on multi-level semantic neighborhood. Our proposed method has several advantages. 1) To our knowledge, this is the first published work that proposes a pre-annotation strategy to determine the test image's category for promoting semantic level. 2) Our proposed method refines labels before image annotation for alleviating the issue of weak-labeling. 3) Our proposed method is based on multi-level semantic neighborhoods, which can provide rich representations at different abstraction scales. As a consequence, this model is suitable for image annotation task because it can address the issue of wider range labels. 4) Our proposed method is a three-pass variant of the traditional KNN, with each pass using a different feature vector. Our method can find visually and semantically similar neighbor images, which can reduce the semantic gap and improve the performance. 5) In contrast to the traditional NN models paying more attention to frequent labels and classical 2PKNN paying more attention to rare labels, our method can improve performance in both per-label and per-image metrics.
Extensive experiments demonstrate that our method can achieve significantly outperforms competitive methods in terms of almost all evaluation metrics. Even though nearest neighbors based annotation models are concept-clear, structure-intuitive, and effective, there are several shortcomings. First, these methods will be time-consuming and spaceconsuming if the number of the training image dataset is huge. Second, the performance of nearest neighbor modelbased AIA methods may be influenced by the size of training datasets.
In the future, we will explore a new modeling strategy based on the existing model combining the merits of discriminative and generative models so as to further reduce modeling complexity. In addition, we are interested in exploring the new technology in attention models into feature extraction and Graph Neural Network into representation learning of multi-modal information.