A Scientometric Visualization Analysis of Image Captioning Research from 2010 to 2020

Image captioning has gradually gained attention in the field of artificial intelligence and become an interesting and challenging task for image understanding. It needs to identify important objects in images, extract attributes, tell relationships, and help the machine generate human-like descriptions. Recent works in deep neural networks have greatly improved the performance of image caption models. However, machines are still unable to imitate the way humans think, talk and communicate, so image captioning remains an ongoing task. It is thus very important to keep up with the latest research and results in the field of image captioning whereas publications on this topic are numerous. Our work aims to help researchers to have a macro-level understanding of image captioning from four aspects: spatial-temporal distribution characteristics, collaborative networks, trends in subject research, and historical evolutionary path. We employ scientometric visualization methods to achieve this goal. The results show that China has published the largest amount of publications in image captioning, but the United States has the greatest impact on research in this area. Besides, thirteen academic groups are identified in the field of image description, with institutions such as Microsoft, Google, Australian National University, and Georgia Institute of Technology being the most prominent research institutions. Meanwhile, we find that evaluation methods, datasets, novel image captioning models based on generative adversarial networks, reinforcement learning, and Transformer, as well as remote sensing image captioning, are the new research trends. Lastly, we conclude that image captioning research has gone through three major development stages from 2010 to 2020, and on this basis, we propose a more comprehensive taxonomy of image captioning.


I. INTRODUCTION
As the representative technology of artificial intelligence (AI), deep learning has developed rapidly in recent years, and has been widely used throughout the fields of computer vision (CV) and natural language processing (NLP). Image captioning (or image caption) is an important part of image understanding, which could automatically generate humanlike sentences for the given image [1]. This task requires the machine to be able to recognize objects in the image, understand the relationships between them, and express the main information by some concise natural language descriptions. Image captioning has been implemented extensively in the field of social media [2], remote sensing [3], robotics [4] and medical image report generation [5], it helps machine "see" the content of pictures, promotes machine intelligence, and will assist machines "think", "talk" and "behave" like humans in the future.
In language-vision community, image captioning has emerged as a popular research area that combines image understanding process and turning the image information into a natural language description [6][7][8]. From a CV point of view, image caption is a more challenging task than image recognition and image classification, owing to the extra challenge of recognizing the objects and actions within the image and making a meaningful description supported the contents found. When the computer encounters an image and outputs the corresponding visual context, it may describe features of the image (e.g., shape, color and texture), it can present the primary objects of the scenario, and even predict a dynamic relationship between people and objects (e.g., a man is playing frisbee game with a puppy). Furthermore, an image description can show objects that are not emerged, and tell a story beyond the visual content(e.g., dad drive his daughter to buy a gift for her mother, even though the picture only shows the dad is driving a car with his daughter, but does not contain the mother and the gift store). In short, image captioning technology requires not only accurate recognition of the image objects, but also contextual and background knowledge to understand the intent of the image expression.
Image description research has made a number of breakthroughs in the last decade. Vinyals et al. [9] extracted features from images and fed them into an recurrent neural network (RNN) with manually annotated statements to obtain image content descriptions. Xu et al. [10] combined long-short term memory (LSTM) models with attentional mechanisms in human vision to focus on salient objects when generating corresponding words. Shi et al. [11] proposed a framework for remote sensing image description using convolutional neural network (CNN). Wu et al. [12] proposed incorporating highlevel concepts into the CNN-RNN approach, and it achieved a significant improvement in image captioning. Lu et al. [13] introduced a visual "sentinel" strategy and designed an adaptive visual attention model. Qu et al. [14] proposed a deep multimodal neural network model for the semantic understanding of high-resolution remote sensing images. Lu et al. [15] constructed a large-scale aerial image dataset for the remote sensing image captioning problem. Yang et al. [16] proposed a multitask learning algorithm for cross-domain image captioning, which simultaneously optimized two coupled objectives of image captioning and text-to-image synthesis through a dual learning mechanism to improve the performance of image captioning. Deep neural networks help a lot to handle challenges in image caption field. To date, researchers have presented a number of review papers [17][18][19][20][21] to summarize the development of image description techniques.
Although these survey articles have provided a good literature review of image captioning, only a portion of the papers on visual captioning can be covered because re-searchers are generally not able to survey the complete data of the publications. Moreover, these review papers tend to focus on models, datasets, and evaluation methods, neglecting to explore the spatial-temporal distribution characteristics, research hotspots, and research communities in the development of the image captioning field. In order to grasp the development direction of image captioning technology from a macro perspective, and to help researchers gain a comprehensive understanding of the development status of the field, we propose a review method for image description based on scientometric analysis.
In this paper, we use scientometric analysis (or bibliometric analysis) methods to establish a systematic review of image captioning research. Being different from traditional interpretive reviews, the scientometric methods [22,23] are data-driven approaches based on bibliographic data, and these approaches provide overall knowledge of research fields that scholars are interested in. The methods often employ particular metrics, e.g., co-word analysis [24], co-authorship analysis [25], and co-citation analysis [26] to visualize literatures.
The remaining of this paper is organized in four parts. Section II tells how the bibliographic data were collected and which methods we would use to do the analysis. Section III presents and expresses results found out by conducting bibliographic analysis. Section IV discusses results obtained through data-driven bibliographic analysis employed in this paper, and presents a taxonomy of image captioning approaches. Section V gives a summary that briefly answers:

II. Data and Methodology
In this research, we provide an overall scientometric analysis framework for image captioning, including two stages of data preparation and bibliometric data analysis. As shown in Fig.  1, firstly, we collected data by setting specific conditions, and then we employed basic metrics, core research community mining, key topics and references identification and evolutionary path of image captioning to answer Q1-Q3. Main bibliographic methods, such as co-authorship analysis, cooccurrences analysis, and co-citation analysis, are mainly employed for the scientometric study of image captioning.

A. DATA PREPARING
There are several bibliographic data sources which can be applied in scientometric analysis. These include the abstract and citation index databases such as Web of science (WOS) and Scopus, the full text data bases such as ScienceDirect, SpringerLink, and ProQuest, the free online database sources such as Google Scholar, Microsoft Academic, Dimensions, and PubMed, and other data sources including PatentDerwent innovations index, BOOK Citation index. This paper uses the WOS to collect bibliographic data to do scientometric analysis of image captioning research, because the WOS contains widely accepted indices such as Science Citation Index Expanded (SCIE), and Social Science Citation Index (SSCI). We set the search conditions by restricting the database to the SCIE and SSCI indices. Considering many high quality papers are collected in conferences of Computer Science, we also picked the Conference Proceedings Citation Index-Science (CPCI-S) index. We confined the time span between 2010 and 2020, and set the article type to "Article" and the language type as "English". We conducted a topic word search using the terms "image captioning" or "image caption". Full conditions for collecting bibliographic data are presented in Table 1. In a word, 697 papers were returned by these settings, including 235 journal articles and 462 proceedings articles. Unlike other disciplines, image captioning techniques are mostly published in conferences related to computer vision and natural language processing. The majorities of these papers were published after 2014 which indicates that image captioning research is emerging.

B. METHODS AND TOOLS
Several scientometric methods and softwares were employed to illustrate or visualize the research situation and progress of image captioning. Before answering Q1-Q3, it is helpful to have an overall impression of image captioning research.

1) BASIC STATISTICS
Basic statistics of scientometric analysis, e.g., yearly publication output, core journals/conferences, countries and institutions reported in this paper can provide "commonsense-knowledge" of a research domain. Besides, we also used Total Local Citation Score (TLCS) and Total Global Citation Score (TGCS) to indicate the influence of a scholar publication or organization. The TLCS is the citation counts of a journal/institution/country within the 697 papers, it reflects specialty in a specific research domain. Whereas the TGCS is the citation counts of a journal/institution/country within papers in WOS, it reflects global impacts. Software used in this analysis was Histcite [27] and VOSviewer [28].

2) ACADEMIC COMMUNITY MINING
It may not be enough for a scholar to have the overall picture of his/her research domain, because academic activities are with collaborations between individual scholars as well as institutions. Knowing academic communities are necessary for a researcher to develop his/her career. Fortunately, coauthor analysis is a powerful method to explore academic communities. Co-author analysis means that two scholars coauthored one or more papers. VOSviewer is a software that can execute this method and display co-author networks, and thus find out academic communities.

3) VISUALIZATION METHODS OF HOT TOPICS, RESEARCH TRENDS AND EVOLUTIONARY PATH
It is often difficult for a beginner to make a traditional literature review, because discussions of hot topic, especially the research trends are usually based on experiences of a scholar who has ploughed in the fields for years. Nevertheless, scientometric analysis may provide a tool to do this job. Coword or co-occurrence analysis is such a tool. Co-word means two words appearing in the same papers, abstracts or keywords units [29]. Co-word analysis often used to discover hot topics and research trends [30]. In addition, the document co-citation network can tell us the research front and knowledge structure in a scientific way. VOSviewer, CiteSpace [31] and HistCite are employed to do the visualization.

A. BASIC BIBLIOMETRIC ANALYSIS
Bibliometric is the quantitative analysis of scholarly publications with the aim of demonstrating their impact on academic fields, it could estimate how much influence a selected research article has on future research. In this section, to obtain yearly output, the sources, spatial distribution, and main research institutions based on publications of image caption, the basic statistical analysis for image captioning research is established.  Therefore, we can conclude that "Image Captioning" research is an emerging trend. The graph also shows that the TLCS and TGCS were largest in 2017, because there is a lag in literature citations, and researchers published a total of 119 articles in 2017, more than the previous decade combined.

2) THE SOURCES OF IMAGE CAPTION RESEARCH
From the analysis of HistCite software, the important conference and journal papers in the field of image captioning are shown in Quantitative bibliometric analysis provides an objective description of the development of a field. Researchers in many disciplines may argue that journal papers have a greater influence on disciplinary development, but the trajectory of the last decade in the field of image captioning shows that many important research findings have been published in computer vision and artificial intelligence top conferences such as CVPR, ICCV and AAAI. As can be seen from Table  2, conference papers exceed journal papers in terms of both volume and impact on image captioning field from 2010 to 2020.
From a temporal perspective, the majority of journal papers on image captioning research were published after 2017. Compared to journal papers, researchers published 15 conference papers in 2015, and the number of conference papers published in each of the next five years increased much more than the number of journal papers. This means, with the development of image captioning, more and more researchers are joining this field of research. Since many of the papers published in international conferences such as CVPR, ICCV and ECCV are open source, researchers can more easily conduct innovative research or expand their research. Furthermore, due to the timeliness and innovation requirements of computer vision conferences, some researchers choose to publish their work in relevant journals.
Although the fruits of image captioning research are mostly submitted to international conferences, many scholars also prefer to publish their findings in journals. The possible reasons for this are mainly the followings. Firstly, journals usually have longer page limits. If a paper has too many experimental results to fit in a conference publication, then a journal affords an opportunity for inclusion. And review papers are usually published in journals due to their length. Secondly, despite the longer review cycle, journal reviews may be more detailed. Thirdly, researchers often prefer journals due to issues such as time, personal preference, university requirements, or practical needs.  Table 3 shows the country distribution of the total number of publications and citations. HistCite software was run according to the pre-defined parameters to obtain the statistics of the number of publications and citations for the top 10 countries, and to generate a table of the distribution of publications and citations for image captioning studies by country. We can see that China and USA produced many of the articles and citations with image captioning, more than 75% of the total publications. Although China has the highest number of publications in this field with 338 articles, papers published in the United States are more influential, with 6215 global citations. This may be because much of the groundwork in artificial intelligence is done in the United States. In addition, from the number of publications in Table 3, we can see that Asian countries are more active in this field.  In order to obtain the core institutions in image captions generation field, data analysis is performed through HistCite and Vosviewer software on the knowledge graphs of research institutions, with dynamics that are often considered cuttingedge leaders in the field. We collected 20 institutes that focused on image captioning and published their re-search fruits at conferences or journals. From For HistCite software, an article with a high TGCS indicates that it has received more attention from scientists around the world. However, if an article has a high TGCS and a small TLCS, it means that this attention is mainly from scientists in other fields rather than image captioning. For beginners in the domain of image description generation, TLCS is more important. TLCS can help researchers to quickly locate classical literature in the field of image caption. As shown in Table 4, papers published by research institutions such as Google, Australian National University and Microsoft are classic literature in this field and worthy of in-depth learning by beginners. In addition, in terms of literature citations in the field of image description generation, prestigious universities in the traditional engineering domains (e.g., MIT, CMU, GIT and Stanford University) and Internet giants (e.g., Microsoft, Google and Facebook) in the United States dominated the highly cited papers. By analyzing the 697 publications, the VOSviewer software is used to derive Figure 3. The size of the labels represents the number of papers or citations. The bottom right corner of the visualization shows the color bar from 2016 to 2019, illustrating the span of time the institution's papers have been cited. Echoing the findings in Table 4, Fig. 3(a) illustrates that Chinese research institutions, represented by the Chinese Academy of Sciences system, have invested a lot of effort in the field of image caption and have published several research papers. Fig. 3(b) demonstrates that the overlay visualization of citation-based map of image captioning research institutions. Take Google as an example, as the most highly cited paper in the field published in 2015, [9] has TLCS of 302. This article is considered to be an early pioneer in doing the image caption task, and it achieves better results with an ingenious modification of the Encoder-Decoder structure.

B. CORE RESEARCH COMMUNITIES MINING
The discovery of implicit research communities from scientists' collaborative networks is of great importance to understand the collaboration and communication patterns of researchers. In order to study the issue in depth, VOSviewer software is designed to find "co-occurrence clustering". This indicates that two things appearing at the same time implies that they are related, and there are various types of such relationships, such as co-authorship and word co-occurrence. In this section, we use VOSviewer to find different types of groups based on the clustering of relationship strength and direction measures. Fig. 4 demonstrates that a more pronounced pattern of collaboration among researchers in the image caption domain. We selected the main research groups, and the most linked collaborators in each group, to analyze the characteristics of image caption communities.

1) MICROSOFT-CENTRIC COMMUNITIES
In the 2015 MS COCO Image Captioning Challenge, Microsoft and Google tied for first place and the two separate systems performed equally well. In this competition, winners are determined based on two main metrics: The percentage of captions that are equal to or better than human-written captions, and the percentage of captions that pass the Turing Test. The campaign, based on a dataset provided by Microsoft [32], raised the popularity of image captioning and led to the inclusion of multiple research institutions in the study. This makes Microsoft one of the leaders in the field of image description. As shown in Fig. 5, since Microsoft Research has published a considerable number of papers which are highly cited on image captioning, the two communities (#1 and #4) resulting from our analysis are Microsoft-centric.
Cluster #1 is the largest research community and includes the influential articles in the image captioning field. Many of the researchers in this community come from Microsoft  Research. As shown in Fig. 5(a), the size of the nodes represents the degree of authorship. Larger nodes indicate that an author is more connected than the other authors represented by smaller nodes. Many of their articles have pioneered new research directions, such as semantic composition networks [33], bottom-up and top-down attention [34], and StyleNet [35]. Besides, Cluster #4 in Fig. 5 Z. Gan et al. focused on semantic concept problem in image captioning, they proposed a method that used semantic information to integrate with recurrent neural network parameters. In order to generate attractive captions for images with different styles, C. Gan et al. propose a novel framework named StyleNet. This paper is the first to investigate the problem of using styles to generate attractive image captions without using supervised special image-caption pairing data.
H. Fang et al. [37] proposed a novel approach for automatically generating image descriptions. The methodology of this paper is divided into three main steps: visual detector, language model, and multimodal similarity model. It trained on images and corresponding captions, and learnt to extract nouns, verbs, and adjectives in the image. In another article from Microsoft Research, J. Devlin et al. [38] achieved state-of-the-art results in image captioning by using convolutional neural network and recurrent neural network.  Research, which has a deep convolutional network for images and a deep recurrent neural network for sentences. This approach is one of the important methods for image captioning studies using multimodal space. Moreover, when Mao was an intern at Google, he proposed a model that can generate an unambiguous caption of a specific object in description generation module, and can also comprehend an expression to infer object which is described [41].

4) OTHER RESEARCH COMMUNITIES
Group #2 (in Fig. 4) has 18 researchers, the second largest community from the analysis result. As the biggest node in this community, M. Rohrbach published eight papers on image-to-text research, many of them related to video description. From Google scholar, he is a research scientist of Facebook AI research, and one of the highly cited authors in the image captioning field. T. Darrell is a highly cited author in the field of computer vision, has made outstanding contributions in the area of semantic segmentation of images, and is one of the co-authors of the deep learning framework Caffe. K. Saenko has an income-generating study of domain adaptation methods in machine learning. In [43] Most of the current research in image-to-text area focuses on captioning of single images, in [46], Z. Lin, et al. tried to generate relational captions for two images which is useful in various practical applications (e.g., image editing, difference interpretation, and retrieval). This paper opens up a new research direction in image captioning technology. Y.F. Wang, Z. Lin, and S. Cohen et al. [47] proposed that the process of human recognizing diagrams should be to locate the position of pictures and their relationship first, and then to elaborate the properties of objects. This article designed a coarse-to-fine approach based on this, first generating the skeleton sentence, then generating the corresponding attribute phrase, and finally synthesizing these two parts into a complete caption. S. Cohen is from Adobe Research, and has three papers with respect to image caption generation based on our dataset. In [48], Cohen et al. focused on the problem of generating captions that are too general in visual language generation. In order to generate better captions, the idea of this article is to add discriminability as a training target during training process. In addition, Cohen et al. investigated the problem of figure caption generation where the goal is to automatically generate a natural language description for a given figure [49]. K. Fu, J.Q. Jin, and C.S. Zhang et al. [50] from group #7 proposed an image captioning system that exploited the parallel structures between images and sentences. In [51], they proposed Image-Text Surgery approach to synthesize pseudoimage-sentence pairs, which can alleviate the expensive manpower of labeling data. In 2019, they presented a novel training objective for image captioning that consisted of two parts representing explicit and implicit knowledge respectively [52]. In cluster #8, X.Y. Dong and Y. Yang et al. [53] proposed a Fast Parameter Adaptation for Image-Text Modeling (FPAIT) that can jointly understand image and text data by a few examples. Y. Wu, L.C. Zhu and Y. Yang et al. [54] introduced the zero-shot novel object captioning task, and proposed a Decoupled Novel Object Captioner (DNOC) framework that can fully decouple the language sequence model from the object descriptions. Researchers in group #9 focused on cross-domain image captioning [16,55]. They used dual learning and multitask learning to improve the performance of image captioning system. W.Y. Lan and X.R. Li et al. from community #10 have mainly studied crosslingual image captioning. In [56], they proposed a fluencyguided learning framework to learn a cross-lingual captioning model from machine-translated sentences. To enable image captioning applications that push the boundaries of languages, X.R. Li et al. proposed COCO-CN to enrich MS-COCO dataset with manually written Chinese sentences and tags [57]. Considering LSTM units were complex and inherently sequential across time, J. Aneja, A. Deshpande, and A.G. Schwing [1] from group #11 developed a convolutional image captioning technique. In [58], J. Aneja et al. proposed SeqCVAE which learns a latent space for every word position. K. Shuster et al. [59] from cluster #12 proposed PERSONALITY-CAPTIONS, where the goal is to be as engaging to humans as possible by incorporating controllable style and personality traits. X. Jia et al. in group #13 presented an extension of the long short term memory (LSTM) model, which they coined gLSTM for short [60].
Since the scientometric approaches for mining collaborator networks relied on dataset as well as algorithms of visualization software, they could not cover all research communities to some extent. In Fig. 9, it shows an active research community of image captioning field. T. Yao et al. [7] investigated the effect of image attribute features on the description results, where image attribute features are   [63], and they demonstrated a video captioning bot named Seeing Bot [64]. Besides, T. Yao et al. [65] presented a hierarchy parsing architecture which integrated hierarchical structure into image encoder to boost captioning.

C. KEY TOPIC AND REFERENCES IDENTIFICATION
Co-authoring networks can demonstrate the social links among scientists in the similar research community. However, some small research groups or independent authors cannot be revealed by co-author analysis, and some of them always have important research ideas. Term co-occurrence analysis could reveal important research topics in image captioning without considering co-authorship. For example, some people who are not co-authors of the same paper may use the same or similar key terms. Thus, the word co-occurrence analysis allows for greater capture of the information structure of textual topics.

1) TERM CO-OCCURRENCE ANALYSIS
Keywords field (containing the title and keywords of the article) are the core summary of an essay and an analysis of keywords field in an essay can provide a glimpse into the topic of the essay. Whereas several keywords given in a paper must be related in some way, this association can be expressed in terms of the frequency of co-occurrence. It is generally accepted that the more frequent a word pair appears in the same document, the closer the relationship between the two topics is. In Fig. 10, analysis by VOSviewer shows that the main research architectures in the field of image captioning in the last decade, especially since the introduction of neural networks into the vision of computer vision. After Google proposed the neural image captioning in 2015, the Encoder-Decoder architecture was widely used and a large number of papers based on this architecture appeared. Later, after the attention mechanism was proposed, researchers combined the two and published a variety of image description generation related articles based on it. Therefore, the "neural network" and the "attention mechanism" are the largest labels in the graph.
The graph also reveals the dominant evaluation metrics and datasets in the field of image captioning. Currently, the criteria for evaluating the quality of automatic image caption can be divided into two categories: human evaluation and machine evaluation. Machine evaluation is fast and inexpensive, but far less accurate than human evaluation. From the leaderboard of the MS COCO competition, it can be seen that the commonly used metrics are BLEU [66], Meteor [67,68], ROUGE [69], CIDEr [70] and SPICE [36]. The first two are judged for machine translation, the third for automatic abstraction, and the last two are supposed to be customized for image caption. The disadvantage of these evaluations is that they focus mainly on the shallow features of the sentences and only match the number of words, ignoring the deep semantics of the sentences. Hence, more accurate and efficient evaluation criteria need to be developed. Furthermore, as we can see, Flichr8k, Flichr30k, and MSCOCO are popular datasets in image captioning community.
As new directions and research areas, generative adversarial networks (GANs), reinforcement learning(RL), and Transformer methods are used for natural language generation in image caption field. Some of the current automatic image description methods adopt reinforcement learning and adversarial learning to annotate images, which can lead to more semantically rich text descriptions. These two methods help to implement unsupervised image captioning. The Transformer models has achieved state-ofthe-art results in natural language processing tasks. For visual description tasks, Transformer models, which reduce structural complexity and explore scalability and training efficiency, are also emerging as a new research direction.
At the same time, remote sensing image captioning is also a new research direction that has been developed in recent years. Many of the published image description papers in remote sensing use CNN-RNN (or LSTM) models based on the attention mechanism. We can also note that visual question answering (VQA), image retrieval, object detection and image captioning are among the more relevant subfields of computer vision and their development is in a mutually reinforcing relationship.

2) CO-CITATION ANALYSIS
Two (or more) papers that are simultaneously cited by one or more subsequent publications are said to constitute a cocitation relationship. Co-citation relationships in the literature change over time, and the study of co-citation networks in the literature allows the development and evolution of a discipline to be explored. For automatic captioning research, we use the CiteSpace clustering function to perform a cluster analysis of literature co-citations and mine similar literature for common themes, and the visualization results are shown in Fig. 11.
Firstly, we can get which articles are among the highly cited papers from the graph. The larger the node, the higher the frequency of co-citation is indicated. Also, the larger the purple outer circle, the higher the mediated centrality of the node. We can see that O. Vinyals (2015) [9], R. Vedantam (2015) [70] and Q.Z. You (2016) [71] et al. are very important papers in the field of image captioning. So cross-disciplinary experts and beginners can obtain the main results of image captioning from these co-cited high-frequency literatures. Secondly, we can also see from the graph which literatures are more closely associated. It means that these documents often appear together in multiple later publications, and the co-referenced documents are certainly similar in content. We use CiteSpace's cluster analysis method to uncover common themes in the related documents. From cluster #0 and cluster #11, we can conclude that domain adaptation model has emerged as a new learning technique to tackle the shortcomings of large amounts of labelled data. From cluster #2, #3, #5 and #7, in which the articles are published from early stage of image captioning technology, demonstrates that "semantic feature matching", "visualization", and "unified hierarchical models" are important in captioning field. Similarly, cluster #1 and #10 shows that "reinforcement learning" and "remote sensing image captioning" are the topics were extracted from the corresponding cited literature. Take group #10 as an example, Qu et al. [14], Shi et al. [11], Lu et al. [15], and Zhang et al. [72] focused on remote sensing image captioning research. These papers are also cited in the citation literature (although the other three articles are not related to remote sensing). So CiteSpace adopted "remote sensing image captioning" as the theme for this group. If researchers are interested in remote sensing image captioning techniques, then they could start by understanding the main innovation points of the above mentioned papers.

D. EVOLUTIONARY PATH OF IMAGE CAPTIONING
Historical reconstructions of scientific evolution can be depicted chronologically as the development of a network of citation relations extracted from the scientific literatures. We would like to illustrate the evolution of image captioning in recent years by combining the timeline view (Fig.12) generated by CiteSpace and the timeline based map (Fig.13) generated by HistCite. Fig. 12 shows the development of keywords in 684 image captioning research papers, spanning the years 2014 to 2020. From top to bottom, the rightmost column presents the top 12 categories obtained by clustering all keywords (including IEEE keywords and author keywords) from largest to smallest, where "task analysis" and "training" are IEEE keywords. From left to right is the timeline of the development of image captioning techniques. The clustering results mainly included object detection, natural language processing, multimodal and visual question answering, as well as deep learning related methods such as "neural networks" and "CNN". It can be seen that Transformer has become one of the frontier research hotspot. From the visualization results of high frequency keywords, "attention mechanism", "attention model" and "attention network" and other attention-related techniques have become popular research directions after 2015. Meanwhile, the CNN-LSTM model has become one of the mainstream models used in research. In addition, image captioning techniques are gradually applied to "remote sensing", "social media", and "medical image reporting". For example, since automatic captioning tools have the potential to empower BVIP (blind and visually impaired people) to know more about social media images without having to rely on human-authored alt text or asking a sighted person. H. Macleod et al. [90] provided the first evaluation of image captions generated by a full-sentence algorithm for the blind and visually impaired people. Fig. 13 presents the citation network obtained after visualization by HistCite, which is drawn directly using the relationship between the cited literature and the references. It utilizes the highly cited literature in the field, and is temporal in nature. Form the research results of the most cited literature in each year, we can see that [9] presented the Neural Image Captioning model for the first time in 2015, bringing the Encoder-Decoder architecture to researchers and opening up a new era of image caption research. In 2016, [71]introduced semantic attention mechanisms into the study of image description, which combined visual information from topdown and bottom-up approaches in a complex neural network framework. This attention-based model is the most widely circulated of the many related methods. Then, [83] proposed a spatial and channel-wise attention based model in 2017. [91]  A summary of image captioning models with their datasets and evaluation metrics are listed in Table 5. VGGNet and ResNet are the most commonly used visual models, and LSTM is the most popular linguistic model. As shown in the table, creating a richer and more applicable image caption dataset is a challenging task. Visual Genome [106] is a largescale image semantic understanding dataset released in 2016, and it has become almost the standard dataset in the research of visual relationship detection. RSICD [15] is a large-scale benchmark data set of remote sensing, which can advance the task of remote sensing image captioning. Another challenging issue is the evaluation metrics. SPICE [36] is a novel semantic evaluation metric that measures how effectively image captions recover objects, attributes and the relations between them. [79] proposed a new metric SPIDEr, which is a linear combination of SPICE and CIDEr.
Therefore, combining the results of the previous studies, it could be seen that the evolutionary path of image captioning is: (1) Prior to 2015, with the development of machine learning and the initial application of deep learning, image captioning was at a steady stage of development and it was not a hot topic in the field of vision research. In 2014, the Microsoft COCO was proposed, this dataset provided the basis for researchers to conduct research on image captioning.
(2) From 2015 to 2018, it was a period of rapid development for image captioning. Firstly, as part of the CVPR 2015 Large scale Scene Understanding workshop, the COCO Captioning Challenge is designed to spur the development of algorithms producing image captions. This captioning challenge has greatly boosted the research enthusiasm of scholars in the field. From the five highest cited articles in 2015, recurrent neural network visual representation model, neural image caption model, the gLSTM model, fast novel visual concept learning and video captioning are the hot spots of image captioning research at this stage. Secondly, after 2017, ResNet is used more often to train vision models, and the subdivision directions of image captioning models are gradually enriched, such as compositional captioning, dense captioning and attention based captioning. Among them, the studies based on different attention mechanisms are the main works in this phase [10,13,34,50,71,75,80,[83][84][85]91]. Thirdly, image captioning technologies have been successfully applied to remote sensing images, medical image reporting and robotics. And studies on datasets [15,57] and evaluation methods [36] are also very important to drive the development of image captioning.
(3) From 2019 to now, image captioning is in a boom phase of development. New techniques such as Transformer, reinforcement learning and GANs have been widely applied to solve image description problems, and unsupervised image captioning methods [92][93][94][95] become a new research hotspot. The form of captioning has become more diverse as it is no longer confined to the overall content of the image [58,81,96]. In addition, Vision-Language Pretraining (VLP) model is an emerging direction of image captioning and image understanding. [97] proposed a unified Vision-Language Pretraining model, which pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence masked vision-language prediction. [98] developed a new pretrainable encoder-decoder structure that simultaneously supports both vision-language understanding and generation downstream tasks.

IV. DISCUSSION
Scientometric analysis studies are primarily designed to provide a broad under-standing of image captions because its methods rely on statistical analysis of bibliographic data.
Through the visualization of the scientometric analysis, we have recognized the spatial and temporal distribution characteristics, clarified the main research communities, understanding the current research hotspots and the scientific evolution paths of the image captioning field. For image description research, there is still a large gap between computer-generated text and human annotated ones, and automatic natural language generation will remain a challenging research topic for a long time.
How can bibliometric methods help the field of image captioning to develop a more scientific and better taxonomy? With the help of the previous analysis, we proposed a taxonomy of automatic image captioning methods. In Fig. 12, we group the different image caption approaches into two main categories, including traditional machine learning based image captioning and deep learning based image captioning. Then, we classify the existing deep neural network based methods into four categories based on "type of learning", "form of captions", "model architecture" and "novel methods". This systematic summary approach provides a more concise overview of the development of image captioning technology. Positive results have been achieved with current unsupervised and partially supervised image captioning techniques [92][93][94][95]99] based on the question of whether training data is needed or whether image-text pairs are required. For the different forms of captions, in addition to the mainstream whole picture based caption, there are also image paragraph caption [100][101][102][103][104], and dense caption [8,77,105,106]. From model architecture point of view, image captioning methods contain Encoder-Decoder architecture and compositional architecture [33,76,107]. Furthermore, novel image captioning methods are included, such as GANs based methods [81,108], RL based methods [109], and Transformer based methods [110][111][112][113][114].
Every approach has limitations, and so does the method based on scientific measurement. Firstly, scientometric analysis may only be applied to those disciplines where literature and its citations are available from appropriate databases. As with many bibliometric methods, we have chosen only WOS to gain data. If there is insufficient published literature on a new research direction, it may not be possible to mine it through bibliometric methods. Secondly, although scientometric analysis is an empirical and objective method for analyzing knowledge structure, the interpretation of the graph is also important. There is also a need to understand the underlying algorithms and parameters in the different literature analysis tools so that the reader can read a good "story". For example, image captioning has only made great progress in the last few years, so the evolutionary path that can be shown on a macro level is relatively limited.

V. CONCLUSION
By analyzing bibliographic data in image captioning research, this article finds that the spatial and temporal distribution characteristics of image caption. The field of image description has shown a year-on-year increase in publications over the last decade. China has published the largest amount of papers in this field, but the United States has had a greater impact on research in this area. Moreover, Microsoft Research, Google Research and other Silicon Valley giants, as well as universities such as the Australian National University and the Georgia Institution of Technology, have performed well in the field of image description.
In the meantime, we can answer the Q1-Q3 presented at the beginning of this article. (A1) Based on VOSviewer, we discover thirteen research communities. As a provider of the MSCOCO dataset, Microsoft researchers have produced numerous innovative results that form one of the key communities in the image caption field. Google Research presented the encoder-decoder architecture to researchers, and it is one of the most important communities in terms of impact. This paper will support the scientific research in image captioning and help researchers to gain an understanding of the current state of the field, the active research community and research trends, so that they can promote the further development and use of image captioning technology.