Stack-VS: Stacked Visual-Semantic Attention for Image Caption Generation

Recently, automatic image caption generation has been an important focus of the work on multimodal translation task. Existing approaches can be roughly categorized into two classes, i.e., top-down and bottom-up, the former transfers the image information (called as visual-level feature) directly into a caption, and the later uses the extracted words (called as semanticlevel attribute) to generate a description. However, previous methods either are typically based one-stage decoder or partially utilize part of visual-level or semantic-level information for image caption generation. In this paper, we address the problem and propose an innovative multi-stage architecture (called as Stack-VS) for rich fine-gained image caption generation, via combining bottom-up and top-down attention models to effectively handle both visual-level and semantic-level information of an input image. Specifically, we also propose a novel well-designed stack decoder model, which is constituted by a sequence of decoder cells, each of which contains two LSTM-layers work interactively to re-optimize attention weights on both visual-level feature vectors and semantic-level attribute embeddings for generating a fine-gained image caption. Extensive experiments on the popular benchmark dataset MSCOCO show the significant improvements on different evaluation metrics, i.e., the improvements on BLEU-4/CIDEr/SPICE scores are 0.372, 1.226 and 0.216, respectively, as compared to the state-of-the-arts.


I. INTRODUCTION
I MAGE annotation has a significant effect for content-based image retrieval (CBIR) [1], which is a process of assigning metadata in the form of captioning or keywords to an image. However, manually image annotation is extremely expensive and time-consuming, especially since the dataset is really large and constantly growing in size. Recently, Automatic Image Caption Generation (named AICG, a.k.a., automatic image tagging) has received a considerable amount of attention in computer vision (CV) and natural language processing (NLP) domain. In fact, the AICG problem can be treated as a multiclass image classification problem, the goal of which is to automatically assign the annotations to the given image via modeling of the correlations between the annotation words and the "visual vocabulary", i.e., extract visual features. The challenge of AICG task lies in effectively modeling on both visual-level and semantic-level information of the given image for generating a meaningful human-like rich image description. Inspired by machine translation, great attention has been paid to exploit the encoder-decoder architecture for image caption generation [2], [3], [4], [5], [6], [7], which commonly consists of a Convolutional Neural Network (CNN) based image feature encoder and a Recurrent Neural Network (RNN) based sentence decoder. There already exist several efforts dedicated to research on this topic, which can be roughly categorized into two classes, i.e., top-down [4], [2], [3], [8], [9] and bottom-up [10], [11]. The former converts image information (called as visual feature) directly into descriptions, while the later converts the generated words (called as semantic attribute) towards aspects of the given image into a word sequence. However, these methods only partially address the challenges in visual-level or semanticlevel, which still suffers from the following issues. (i) Lack of end-to-end formulation that generates sentences based on individual aspects; and (ii) Generation of ambiguous information due to the whole image maps involved. These drawbacks significantly reduce the accuracy and richness of the generated descriptions, and naturally lead to the attempt that combines such two models for caption generation on both visual-level and semantic-level information.
Recently, there already have several works that employ a combination of top-down and bottom-up models to imitate human cognitive behavior through applying attentions to the salient image regions [12], [13]. However, most of such approaches are based on single stage training and easily omit the interaction among detected objects during decoding, such as [12], and thus predict error semantic attributes (e.g., positional relations) among objects, since ignoring the contextbased fine-gained visual information, which inevitably leads to the difficulty in generating a rich fine-gained description.
To this end, this paper proposes a unified coarse-tofine multi-stage architecture to combine bottom-up and topdown approaches based on visual-semantic attention model, which is capable of effectively leveraging both the visuallevel image feature and semantic-level attributes for image caption generation. Fig. 1 present an illustration regarding to our proposed method (called as Stack-VS). Specifically, there are two branches, i.e., a visual image detector and a semantic attribute detector, to simultaneously construct the original visual-semantic information for salient regions during encoding, and the decoder includes a stack of multiple decoder cells that are linked for repeatedly generating the finer details, of which each decoder cell consists of two LSTM [14], [15] attention-based layers, namely, visual-semantic attention layer and a language model. In particular, we employ the visualsemantic attention layer to simultaneously assign attention weights to visual features and semantic attributes, and the attended image features and semantic attributes are fed into a language attention model for generating the hidden states of the current stage at each time step, and then transferred to the next decoder cell for re-optimization.
The main contributions of this work are as follows. For each predicted word of the generated caption, the input image is firstly passed through a visual image detector and a semantic attribute detector, and then a series of image feature vectors and semantic attribute embeddings of salient regions are obtained. Subsequently, the proposed Stack-VS attention model gradually generates and refines the attention weights for such feature vectors and attribute embeddings at each time step, in which the thickness of the red line and the blue line represent the contributions of the visual-level and semantic-level information derived from the input image to the predicted word of the generated caption through multi-stages. Consequently, the final word is the output of the final stage.
• We propose a novel coarse-to-fine multi-stage architecture to combine bottom-up and top-down attention model for image caption generation, which is capable of effectively handling both visual-level and semantic-level information of the given image to generate a fine-gained image caption. • We propose a well-designed stack model which links a sequence of decoder cells together, each of which contains two LSTM-layers work interactively for generating finegained image caption via re-optimizing visual-semantic attention weights during multi-stage decoding. • Extensive experiments on popular benchmark dataset MSCOCO with two different optimizer, i.e., cross entropy-based and reinforcement learning-based, which demonstrate the effectiveness of our proposed model, as compared to the state-of-the-arts, achieving BLEU-4 / CIDEr / SPICE scores are 0.372, 1.226 and 0.216 respectively.
Road Map. The remaining of the paper is organized as follows. In Section II, we review the related work. The proposed the learning of coarse-to-fine image caption generation approach is given in Section III, followed by the experimental results in Section IV. Finally, Section V concludes this paper.

II. RELATED WORK
CNN+RNN based Model. With the success of sequence generation in machine translation (MT) [2], [4], encoderdecoder model has been an important focus of the work on automatic image caption generation. Mao et al. propose a twolayer factored neural network model [2], i.e., a CNN-based visual-level extraction network and a RNN-based word-level embedding network, for estimating the probability of generating the next word conditioned on the CNN extracted image feature and the previous word. Analogous to [2], Vinyals et al. [4] propose a combination of CNN and RNN framework, in which the visual features extracted by CNN are fed as the input at the first step, to maximize a likelihood function of the target sentence. Instead of giving the entire image features, [3] proposes an alignment model to learn a multimodal embedding for generating the descriptions over image regions. However, most of these models are inadequacy for the learning of RNN model owing to making use of a fixed-size form to represent an image (e.g., 4096-dimensional CNN feature vector [16]), which might suffer from the imbalance problem when encoding-decoding the visual-semantic information.
Attention-based model. There already exist several efforts dedicated to research on the automatic image caption generation problem by means of visual attention mechanism [9], [17], [18], which steers the caption model to focus on the salient image regions when generating target words. Recently, the existing attention-based approaches can be roughly categorized into two classes, i.e., bottom-up and top-down. Bottom-up approaches usually start with visual concepts, objects, attributes, words and phrases, and then combine them into sentences by using language model [12]. However, they may neglect the interaction among objects, and thus these approaches may tend to maintain certain patterns to generate the final captions.
Top-down approaches start with a "gists" of the image and convert them into several words [9], [17], [18], which enables the model to concentrate more on natural basis. Although the great success has been achieved by top-down approaches, they cannot extract fine-gained image information from the given image due to mass noise data involved.
Fusion Model. To alleviate the above shortcomings, in fact many attempts have devoted to research on how to improve the accuracy and richness of the generated descriptions through combining top-down and bottom-up architectures [12], [13], [16] together. However, the vast majority of studies tend to transform the image information into object feature vectors or semantic word embeddings to model the representations of the input image, which would lead to a noticeable decline in performance due to the following facts. First, unambiguous image information might also be introduced during encoding, and directly compressing such information into image feature vectors would lead to the difficulty in the process of decoding semantic information from the visual one; Second, omitting the relations in visual-level and semantic-level of detected objects, e.g., positional information, would result in a unnatural caption generated. Third, solely modeling the image caption problem through a one-stage training model is insufficient for generating a fine-gained image. Therefore, these methods still easily fail to achieve the image caption generation task. Indeed, Gu et al. [19] propose a multi-stage based framework for addressing the problem, in which the framework is constituted by multiple stages, and the predicted hidden states and the re-calculated attention weights are repeatedly produced for generating the next word at each time step. However, it cannot simultaneously handle visual-semantic information during decoding, and ignores the attention information from previous time steps. In this work, our proposed approach is based on a coarse-to-fine architecture with a stack of decoder cells for repeatedly generating finegained caption via multi-stages at each time step. Specifically, each decoder cell consists of two LSTM layers that is capable of handling both visual-level and semantic-level information of the given image for better capturing the interactions of salient regions, as well as assigning different attention weights for visual-level and semantic-level information, respectively, which is benefit for generating more accurate and rich descriptions.

III. COARSE-TO-FINE LEARNING FOR CAPTION GENERATION
In this section, we first present the formulation of our image caption generation problem in Section III-A, followed by the illustration of image encoder in Section III-B and the coarseto-fine decoder in Section III-C. Then, the key structure of the decoder, i.e., decoder cell, is detailed in Section III-D, and then we discuss the learning process regarding to our method.

A. Problem Formulation & Overview
Problem Formulation. Given a pair-wise training instance (I, S), I and S = {s i } Ng i=1 denote the input image and its corresponding generated description sentence respectively, where s i is the i-th word of sentence S and N g is the sentence length. Then, the caption generation problem can be formulated by employing a likelihood function to minimize the negative log probability of the ground truth sentences for training, which practically means we need to optimize Eq. (1) over the entire training pairs via an optimizer 1 during training. (1) where Pr(s k |I, s 1:k−1 ) indicates the probability of generating the k-th word s k conditioned on the given image I and the previous word sequence s 1:k−1 ; θ is the parameter set of the model and will be neglected for notational convenience; I and S denote the training image set and its corresponding reference captions, respectively.
Therefore, the problem is transformed to estimate the probability, i.e.,Pr(s k |s 1:k−1 , I). Many approaches are proposed for this task [9], [13], [17], [12], [11], [7]. They are typically based on a deep RNN multi-modal architecture in encoder-decoder manner, the primary idea of which is to encode the visual feature vectors (V) [9], [13], [17], and/or semantic attribute embeddings (E) [12], [11], [7] into a fixed-size context hidden state vector, and then decode it to generate a possibly variablysized description S. Without loss of generality, Eq. (1) can be rewritten as, (2) Overview. As aforementioned, the conventional approaches are commonly based on simple one-pass attention mechanism [13], [21], [4], [22], and thus hardly generate rich and more human-like captions. By following the previous work [19], we employ a coarse-to-fine architecture to exploit a sequence of intermediate-level sentence decoders for repeatedly refining image descriptions, i.e., each stage predicts the incrementally refined description from the preceding stage and passes it to the next one. As such, let I (i−1) be the output of the preceding stage, where i ∈ [1, N s ] and N s denotes the total number of fine stages. Consequently, Eq. (2) is converted by Eq. (3), which means the sum of negative log probability over the ground truth sentences, where it should be noted that the first stage (i = 1 ) is a coarse decoder that generates the coarse description [19], and the subsequent stages (i > 1) are the attention-based fine decoders that increasingly produce the refined attentions for prediction.
Remark. By minimizing the loss function L in Eq. (3), it enable to model the contextual relationships among words in the target sentence. The entire coarse-to-fine framework is depicted in Fig. 2, which mainly consists of two parts, namely image encoding and coarse-to-fine decoding, and will be elaborated in detail later on.
be the semantic attribute embedding of image I respectively, which are obtained as follows, • Visual-level Image Feature. To model the visual-level features of the input image (I), we employ a widelyadopted approach, i.e., Faster-RCNN model [25], to encode image (I) into a spatial feature vector (V) [9], [13], [17]. Here, each spatial image feature v i ∈ R dv is a d v -dimensional fixed-size feature vector (empirically set as 2048) and denoted as the mean-pooled convolutional feature of the i-th salient region. It is encoded by CNN according to the detected bounding boxes with a bottomup attention mechanism and recursively activated at each time step. Additionally, each element v i of the initial visual feature vector (denoted as V 0 ) is generated based on the output of the final convolutional layer of Faster-RCNN.
• Semantic-level Attribute Embedding. An image (I) typically contains various semantic attributes which is paves a way to predict the probability distribution of semantic instances over the massive attributes to obtain the initial list of semantic attributes (denoted as E 0 ) that are most likely to appear in the image. By following the work [7], we use a weakly-supervised multiple instance learning (called MIL [11], [12]) to independently learn the most-likely semantic-level attributes from the given image. In particular, the part-of-speeches (POSs) of attributes in our proposed method are unconstrained, e.g., nouns, verbs, adjectives and etc., which poses challenges in generating a much more natural captions, and thus we further explore the correlations between the visual-level features and the semantic-level attributes of the given image (rf. Section III-C). For ease of implementation, each attribute is resized into a d e -dimensional fixed-size semantic attribute embedding (e i ) before decoding.

C. Coarse-to-Fine Decoding
In this section, we present our proposed coarse-to-fine decoding architecture in detail, which is composed of a sequence of attention-based decoders to repeatedly produce the refined image descriptions. For clarity, Fig. 3 presents the overall architecture of coarse-to-fine sentence decoder.
More concretely, each predicted word of the generated description is decoded within multiple stages at each time step t ∈ [0, T − 1], where T is the length of the generated caption. Without loss of generality, let N s be the number of stages in our model, where each stage is constituted by T decoder cells, in which each decoder cell is a predictor for producing the refined hidden state (h i t ) over a single time step t, we will elaborate how to define each decoder cell later on (rf. III-D). Particularly, by following the work [19], we refer to the operation of the RNN unit (i.e., LSTM [14], [15]) to compute the hidden state of the i-th stage over a single time step t, namely h i t , whereŝ (t−1) denotes the predicted word embedding at time step (t − 1); V 0 and E 0 indicate the initial N v -dimensional visual feature vector and N e -dimensional semantic attribute embeddings; I is the output of the the preceding stage, which is defined as wherev are the attended visual feature vector and the attended semantic attribute embedding of the (i − 1)-th stage at time step t. The details of such stuffs will be elaborated in Section III-D.
Remark. In particular, the first stage of our model is called coarse stage, which generates a roughly attention weighted caption based on both the initial visual-level feature vector and the semantic-level attribute embedding. Accordingly, we can obtain a coarse prediction for the next stage, where the corresponding input is given by ( is the hidden state of the final stage at time step (t − 1); andv 0 t is a vector with all its elements equal to 0 (similar forê 0 t ).

D. Decoder Cell
As mentioned in work [4], [19], modeling of the entire information of an image for each single target word generation would lead to sub-optimal solution due to the irrelevant cues introduced from other salient regions. Therefore, we need to calculate an attention based on much more natural basis for generating a more human-like caption.
Inspired by [4] but entirely different from their work, each decoder cell at time step t is designed as shown in Fig. 3, in which the output of each decoder cell is generated through two LSTM layers, i.e., visual-semantic LSTM layer and Language LSTM layer. Next, we will elaborate them in detail.
Visual-Semantic LSTM Layer. As shown in Fig. 3, there are two different branches in visual-semantic LSTM layer, i.e., the red line denotes the flow diagram of visual attention model (denoted as LSTM V ), and the blue one indicates the flow diagram of semantic attention model (denoted as LSTM S ), which work interactively to calculate the attention weights on visual feature vectors and semantic attribute embeddings.
Visual-Attention Model. To be specific, the input of LSTM V is used to model the maximum visual-level context based on the output of Language LSTM of the preceding stage, as well as the predicted word embedding generated so far. Hence, the input x i,V t to visual attention model in the i-th stage at time step t consists of the previous hidden state h (i−1),L t of the language LSTM, concatenated with the predicted word embedding at time step (t − 1), which is Then, the output of LSTM V can be computed as follows: To tackle the sub-optimal problem, we generate a visual attention weight for the k-th visual feature vector v k ∈ V 0 of the i-th stage at time step t, which is given by where hs,a ∈ R da×d h are trainable parameter matrices; and d a is the number of hidden units in visual (or semantic) attention model; d h is the dimension of the output of LSTM V (or LSTM S ). Then, the attended visual feature vector can be calculated as follows: Semantic Attention Model. Analogous to visual attention model, we can easily obtain the output of LSTM S can be calculated as follows For the k-th semantic feature vector e k ∈ E 0 of the i-th stage at time step t, the semantic attention weight is given by Consequently, the attended semantic attribute embedding can be calculated as follows: Language LSTM. Different to the work [13], we model the Language model based on both the visual-level and semanticlevel information of the given image I. Specifically, the input of LSTM L is constituted by the visual-semantic information of the given image, that is, the sum of the outputs (h i,V t , h i,S t ) of LSTM V and LSTM S , as well as the outputs (v i t ,ê i t ) of the visual attention model and the semantic attention model. However, there exists the cross-modal problem, i.e., the dimension ofv i t is different fromê i t since we utilize different approaches to initializing the visual feature vector (V 0 ) and the semantic attribute embeddings (E 0 ). As such, we use two fully connected layers, i.e., FC V , FC S to convert them into a unified form (the dimension of which is the same as h i,V t and h i,S t ), and thus the input (x i,L t ) and the output (h i,L t ) of LSTM L is given by, where h i,L t−1 denotes the the hidden state of last Language LSTM.
Subsequently, the probability of generating the k-th word s k given the previous word sequence s 1:k1 , the initial visual vector features V 0 , and semantic attribute embedding E 0 , are concatenated with the output of the preceding stage I (i−1) , i.e., Pr(s k |s 1:k−1 , V, E, I (i−1) ) in Eq. (3), which can be rewritten as follows, k denotes the hidden state of Language LSTM given by Eq. (16); W L h l ,p ∈ R dp×d h refers to a trainable parameter matrix, and d p is the vocabulary size.

E. Learning
Indeed, the coarse-to-fine framework is a complex deep architecture, its training process would easily leads to the vanishing gradient problem due to the noticeable decline in the magnitude of gradients when back-propagated via multiple stages [19]. To tackle the problem, we consider to employ a cross-entropy (XE) loss to incorporate ground-truth information into the learning process of intermediate layers for supervision, and thus the overall loss function is defined by cumulatively calculating of the cross-entropy (XE) loss at each stage, which is computed given by Eq.
where p θ0:i (s k |s 1:k−1 , V 0 , E 0 , I (i−1) ) is the probability of the k-th word s k given by the output of LSTM L decoder at the i-th stage; and θ 0:i denotes the parameters up to the i-th stage decoder. Note that the attended weights of our model are needed to be re-trained at each time step due to the received information to fluctuate at each stage basis , it is quite different from the shared weight network in [19].
However, the exposure bias issue has not been fundamentally solved in existing cross-entropy loss based training approaches [19], [13], [21]. To this end, an alternative solution is introduced to treat the generative model as a reinforcement learning based method [26], [27], [21], in which the Language LSTM decoder is viewed as an "agent" to interacts with an external "environment" (i.e., the visual-semantic information of the image). For simplicity, letS = {s t } 1:N T be the sampled caption and each word s k is sampled from the output of the final stage (N s ) at time step t, according to an action s t ∼p θ , where p θ is a policy based on the parameterized network θ, and N T is the length of the sampled caption. To minimize the negative expected rewards (punishments), the estimation is defined as where r(.) is the reward function (e.g., CIDEr) that is computed by comparing the generated caption with the corresponding reference captions of the input image using the standard evaluation metric.
To reduce the variance of the gradient estimate, for each training sample, the expected gradient can be approximated with a single Monte-Carlo sample s t ∼ p θ by following the work (Self-Critical Sequence Training, SCST [21]), which is computed by where b is a score function of θ that is empirically set as CIDEr(Ŝ), andŜ is the sampled reference caption by greedy decoding.
Remark. Essentially, the equation given in Eq. (20) trends to increases the probability of sampled captions with higher scores than the samples generated based on the current model.

IV. EXPERIMENTAL RESULTS
In this section, we first present the experimental datasets and the evaluation metrics, and then describe the baselines for comparison with the data-processing and the implementation details. Finally, we conduct extensive experiments to evaluate the performance of our proposed algorithm.

A. Datasets
We evaluate our proposed method on a popular benchmark dataset i.e., MSCOCO [28], [29], which contains 164, 062 images with 995, 684 captions, and each image has at least five reference captions. The dataset is further divided into three parts, i.e., training (82, 783 images), validation (40, 504 images) and testing (40, 504 images), which are withheld in MSCOCO server for on-line comparison. To evaluate the quality of the generated captions, we follow "karpathy split" evaluation strategy in [30], i.e., 5, 000 images are chosen for offline validation, and another 5, 000 images are selected for offline testing. Then the evaluation results are reported by the publicly available MSCOCO Evaluation Toolkit 2 , as compared to the state-of-the-arts listed in the leader board on MSCOCO online evaluation server 3 . We also evaluate our proposed model Stack-VS on such platform.

B. Evaluation Metrics
To evaluate the image caption generation performance of different approaches, we adopt the widely used evaluation metrics, i.e., BLEU, Rouge, METEOR and CIDEr. In particular, we also adopt a widely used evaluation metric, i.e., SPICE, which is more similar to human evaluation criteria. For all the metrics, the higher the better.
BLEU [31], which is widely used in machine translation domain, and defined as the geometric mean of n-gram precision scores multiplied by a brevity penalty for short sentences.
ROUGE [32], which is initially proposed for summarization and is used to compare the overlapping n-grams, word sequences and word pairs. Here, we use its variant, i.e., ROUGE-L, which basically measures the longest common subsequences between a pair of sentences.
METEOR [33], which is a machine translation metric and is defined as the harmonic mean of precision and recall of unigram matches between sentences, via making use of synonyms and paraphrase matching.
CIDEr [30], which measures the consensus between candidate image description and the reference sentences provided by human annotators. Particularly, in order to calculate it, an initial stemming is applied and each sentence is represented with a set of 1-4 grams. Then, the co-occurrences of n-grams in the reference sentences and candidate sentence are calculated.
SPICE [34], which is estimated according to the agreement of the scene-graph tuples of the candidate sentence and all reference sentences. Specifically, the scene-graph is essentially a semantic representation of parsing the input sentence to a set of semantic tokens, e.g., object classes, relation types or attributes.

C. Baseline Methods & Parameter Setting
Here, we compare our approach with To evaluate the effectiveness of our proposed method, we compare our model with the state-of-the-art methods as follows, 1) Hard-Attention [9]. This method incorporates spatial attention on convolutional features of an image into a encoder-decoder framework through training "hard" stochastic attention by reinforce algorithm. Note that, this method also can be trained with "soft" attention mechanism with standard back-propagation, however we neglects it due to the poor performance reported as compared to the "hard" one. 2) Semantic-Attention [12]. This method models selectively attend to semantic concept proposals, then mix them into the hidden states and the outputs of the recurrent neural networks for image caption generation. Essentially, the selection and fusion is regarded as a feedback that connects the computation of top-down and bottom-up. 3) MAT [16]. This method converts the input image into a sequence of detected objects that feeds as the source sequence of the RNN model, and then such sequential instances are translated into the target sequence of the RNN model for generating image descriptions. 4) SCST:Att2all [21]. This methods utilizes a "soft" topdown attention with context provided by a representation of a partially-completed sequence as context to weight visual-level features for captioning, which is based on a reinforcement learning method with a self-critical sequence training strategy to train LSTM with expected sentence-level reward loss. 5) Stack-Cap [19]. This method repeatedly produces the refined image description based on visual-level information with a coarse-to-fine multi-stage framework, which incorporates the supervision information for addressing the vanishing gradient problem during training. 6) Up-Down [13]. This method combines bottom-up and top-down attention mechanisms to calculate the regionlevel attentions for caption generation. To be specific, it is essentially based on a encoder-decoder architecture, and first makes use of bottom-up attention mechanism to obtain salient image regions, and then the decoder employs two LSTM layers to compute the attentions at the level of objects and the salient regions with top-down attention mechanism for sentence generation.
Our method. Our proposed method is named Stack-VS, which is based on multi-stage architecture with a sequence of decoder cells to repeatedly produce rich fine-gained image caption generation. In particular, each decoder cell contains two LSTM-layers work interactively to re-optimize attention weights on both visual-level feature vectors and semanticlevel attribute embeddings for generating a fine-gained image caption.
Parameter Settings. To evaluate the quality of the image caption generation results of different methods, we follow the evaluation strategy in [21]. That is, we generate the vocabulary with a standard strategy to filter words, namely, the words appearing less than 5 times in frequency are filtered out.
After the processing, we obtained 8, 791 words in the final vocabulary. For a fair empirical comparison, we use the same setting in [13] to extract visual-level feature (i.e., V 0 ), where N v = 36 and each feature vector in V 0 is formed as a 2, 048dimensional mean-pooled convolutional feature vector to the corresponding salient region. Similarly, we adopt multipleinstance learning(MIL) [12], [11] model to obtain the semantic attribute embeddings (i.e., E 0 ), where N e = 20 and each attribute embedding is formed as a 2, 048-dimensional vector converted by an embedding matrix that is randomly initialized and learned independently during training. Our proposed model is implemented based on LSTM network, and the parameters for our proposed method are empirically set as follows: (1) The number of hidden units in visual attention model and the semantic model are set to 512, respectively; (2) We use adaptive moment estimation (Adam [20]) for optimization in the process of supervised cross-entropy training, and the learning rate is set to 5e − 4, which is shrunk by 0.8 for each 3 epochs; (3) The initial increasement rate of scheduled sampling is set as 0.05 for each 5 epochs, and its upper bound is set as 0.25. After 30 epochs, we make use of a reinforcement learning based method to continue to optimize it according to the learning rate, which is set by 5e − 5; and (4) The batch size is set to 78, and the maximum number of epochs is 100, analogous to [19].

D. Quantitative Analysis
In this section, we evaluate the effectiveness of generating the image captions by our approach, in comparison to the baseline methods on the test portion of the "Karpath" test splits. For fair comparison, all of compared methods are merely trained on MSCOCO dataset, and thus the models  using external information will not be considered as candidates for comparison. Table I and Table II shows BLEU (i.e., B-1, B-2, B-3, B-4), METEOR, RougeL, CIDEr and SPICE of each method.
From Table I and Table II, we observe that: First, Stack-Cap [19] performs better than Hard-Attention, Semantic-Attention, MAT and SCST:Att2all on all metrics. For example, Stack-Cap outperforms SCST:Att2all by 5.56%, 2.62%, 2.15% and 5.61% in terms of BLEU-4, METEOR, RougeL and CIDEr, respectively. This is because Stack-Cap employs a sequence of decoder cells for repeatedly refining the image descriptions while SCST:Att2all is based on a simple one-pass attention architecture for training, which hardly generate rich and more human-like captions. The results demonstrate that using multistage structure is more effective than using one-pass. Second, Up-Down [CIDEr-Optimize] outperforms Stack-Cap by 1.53% and 2.39% in terms of BLEU-1 and SPICE respectively as Up-Down computes the region-level attentions with a combination of top-down top-down attention mechanisms for captioning, which demonstrates the effectiveness of the combination architecture. Third, our proposed model Stack-VS consistently outperforms all baseline methods, and the improvements are statistically significant on all metrics. For example, Stack-VS outperforms Stack-Cap by 3.05%, 1.82%, 1.41%, 1.83% and 3.35% in terms of BLEU-4, METEOR, RougeL, CIDEr and SPICE. The reason might be due to two facts: (1) The predicted words based on bottom-up are a degree of depending on the visual-semantic information of the input image, and thus Stack-VS simultaneously handle visuallevel and semantic-level information when generating finegained details like pre-positions; (2) The generated captions essentially need to dependent on more visual-level information when generating several semantic words such as noun or adjective. However, Stack-Cap solely relies on visual-level information, which results in more errors while generating some semantic auxiliary words like preposition. On the other side, our proposed model also outperforms Up-Down [CIDEr-Optimize] by 2.48%, 0.72%, 1.41%, 2.08% and 0.93% in terms of BLEU-4, METEOR, RougeL, CIDEr and SPICE, even though it also has a similar LSTM attention mechanism like ours. The explanation is that Up-Down solely relies on visual-level information while neglecting the semantic-level one, and meanwhile the a single stage based traing process also limits its capability in generating a fine gained caption. In contrast, our proposed model is based on a coarse-tofine structure with a sequence of decoder cells for repeatedly generating fine-gained caption in a stage-by-stage manner, and at each time step, each decoder cell can simultaneously handle both visual-level and semantic-level information for better capturing the interactions of salient regions, and then assign different attention weights for visual-level and semantic-level hidden states for improving the accuracy and richness of the generated descriptions.

E. Qualitative Analysis
In this section, we conduct an in-depth analysis on our proposed model to demonstrate the superiority of generating better description within the proposed multi-stage visualsemantic attention based framework. Different Impacts of Visual-level and Semantic-level Information. From Fig. 4, we can observe that our proposed model is capable of adaptively leveraging the different contributions derived from visual-level and semantic-level information for captioning. More concretely, the proportion of the attended weights of semantic-level information increases when generating prepositions (e.g., "in", "front" or "of"), however generating some nouns-related objects (e.g., "people", "rides","trains","bicycles") would make the curve drops to some extent, which is able to adaptively make use of both visual-level and semantic-level information for image caption generation.   we take the first image as example. After processed, the word "grass" and "parked" in the final stage (e.g., stage-3) obtain the highest attention scores at the previous time step, and thus the prediction words at the current time step are correctly outputted; and (2) Filters out noises and adaptively learn the linguistic pattern to assign the highly-relevant word pair with the highest scores, such as "green" and "grass", the attention scores are far from each other in stage-1, however they are assigned the top-2 scores in stage-3 after refining, as the adjective "green" is the most appropriate adjective to describe the noun "grass".
Impact of Visual-level Attention. To demonstrate using visual-level attention can generate more plentiful image captions, we visualize the assignment process of visual-level attention weight at each time step for an in-depth analysis. As shown in Fig. 6, we observe that our proposed Stack-VS model can gradually filter out noises and the distribution of the visual-level attentions at previous time step are concentrated to the corresponding regions that are highly relevant to the next prediction words, and the learned assignments are correctly consistent with human intuition. We take the first image as example, the attention visualizations of the objects like "bird", "train" or "beach", and the distribution of visual-level attention is highly pinpointed to the corresponding salient regions, which indicates that learning the visual-level attention is effective for generating the high quality image captions.
Impact of Stacked Refinement. Here, we mainly focus on the analysis of the impact of stacked refinement in our proposed visual-semantic attention based multi-stage architecture. In fact, the caption generated in the coarse stage is proper for understanding the input image, however the accuracy and richness of the generated description is insufficient. We can take the images listed in Fig. 7 as example for an in-depth analysis. From the figure, we can observed that the ultimately predicted words are obtained by superimposing the visual-level and semantic-level information to refine the description at each stage for generating a more fine-gained caption. Indeed, the model at the next stage re-assigns the visual-semantic attention weights for fined-tuning weights to repeatedly output several auxiliary words to infer the semantic details of the given image using the information from the previous stages. Hence, actually the sum of semantic-level weights at stage-3 is lower than the previous ones, as the attention weights are not strictly assigned to the limited scope of candidate words. Without loss of generality, we take the first image as example. The first stage is good enough to detect the correct objects bridged with several simple prepositions for captioning. At the next stage, the process of the stacked refinement is to capture the relations among objects to form the skeleton of the caption, and then refine several words to supplement the details to generate more natural and human-like captions through multiple stages. For example, in the final caption, which utilizes " wearing a suit " to replace of "in a suit", and add some details to describe "a man", such as "wearing ... and tie", "in a mirror".

V. CONCLUSION
In this paper, we have proposed a visual-semantic attention based multi-stage framework for image caption generation problem. Specifically, the proposed model is based on a combination of top-down and bottom-up architecture with a sequence of decoder cells, in which each decode cell is reoptimized for the hidden states of each decoder cell linked oneby-one at each time step. In particular, differ from previous studies that merely feed the visual-level or semantic-level features alone into the decoder to generate descriptions, our proposed model simultaneously feed both of such information into each decoder cell to work interactively to repeatedly refining the weights via two LSTM based layers. Experiments conducted on the popular dataset MSCOCO demonstrate that our proposed model demonstrate the comparable performance over the state-of-the-art methods using ensemble on the online MSCOCO test server. For future work, we plan to investigate the following issues: (1) Incorporating natural language inference to generate more reasonable captions; and (2) Trying different architectures for captioning, such as graph convolution network.

PLACE PHOTO HERE
Wei Wei received the PhD degree from Huazhong University of Science and Technology, Wuhan, China, in 2012. He is currently an associate professor with the School Computer Science and Technology, Huazhong University of Science and Technology. He was a research fellow with Nanyang Technological University, Singapore, and Singapore Management University, Singapore. His current research interests include computer vision, natural language processing, information retrieval, data mining, and social computing.

PLACE PHOTO HERE
Ling Cheng received the master degree from Fudan University, Shanghai, China, in 2019. He is currently a PhD student at Singapore Management University, Singapore. His research interests include computer vision and deep learning.

PLACE PHOTO HERE
Xian-Ling Mao received the PhD degree from Peking University, in 2013. He is currently an associate professor of computer science with the Beijing Institute of Technology. He works in the fields of machine learning and information retrieval. His current research interests include topic modeling, learning to hashing, and question answering.
Dr. Mao is a member of the IEEE Computer Society and a member of the Association for Computing Machinery (ACM).

PLACE PHOTO HERE
Guangyou Zhou received the PhD degree from National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (IACAS) in 2013. Currently, he worked as a Professor at the School of Computer, Central China Normal University. His research interests include natural language processing and information retrieval. He has published more than 30 papers in the related fields.

PLACE PHOTO HERE
Feida Zhu is an associate professor at Singapore Management University (SMU), Singapore. He received his Ph.D. degree from the University of Illinois at Urbana-Champaign (UIUC) in 2009. His research interests include large-scale data mining, text mining, graph/network mining and social network analysis.