A Sparse Transformer-Based Approach for Image Captioning

Image Captioning is the task of providing a natural language description for an image. It has caught significant amounts of attention from both computer vision and natural language processing communities. Most image captioning models adopt deep encoder-decoder architectures to achieve state-of-the-art performances. However, it is difficult to model knowledge on relationships between input image region pairs in the encoder. Furthermore, the word in the decoder hardly knows the correlation to specific image regions. In this article, a novel deep encoder-decoder model is proposed for image captioning which is developed on sparse Transformer framework. The encoder adopts a multi-level representation of image features based on self-attention to exploit low-level and high-level features, naturally the correlations between image region pairs are adequately modeled as self-attention operation can be seen as a way of encoding pairwise relationships. The decoder improves the concentration of multi-head self-attention on the global context by explicitly selecting the most relevant segments at each row of the attention matrix. It can help the model focus on the more contributing image regions and generate more accurate words in the context. Experiments demonstrate that our model outperforms previous methods and achieves higher performance on MSCOCO and Flickr30k datasets. Our code is available at https://github.com/2014gaokao/ImageCaptioning.


I. INTRODUCTION
Nowadays, we encounter a large number of images from the internet while most of them do not have a description. However, machine needs to automatic interpret image captions in the era of Artificial Intelligence. Image captioning has been an important research direction for many years. For example, it could help one who is visually impaired to understand images on the Internet. Also it has found applications in many areas such as content-based image retrieval, social media platforms and human-machine interaction. Image captioning is one of the most challenging tasks in computer vision that intends to automatically generate natural descriptions for images. It does not only require detecting the objects and their relationships in an image but also needs to generate a syntactically and semantically correct sentence [1].
The associate editor coordinating the review of this manuscript and approving it for publication was Chang-Hwan Son .
The encoder-decoder architecture is the mainstream approach of image captioning followed by [2], [3]. In such a framework for image captioning, an image is first encoded into a set of feature vectors via a CNN based network [4] and then decoded to words via an RNN based network [5]. In addition, attention mechanism [6] plays an important role in such an encoder-decoder framework to generate encoded vectors based on image features and hidden states of LSTM at each time step which aims to align words to specific image regions [7]. In recent years, the great success of Transformer network [8] has attracted many researchers to explore towards the cutting edge techniques of self-attention for image captioning. The outstanding performance of selfattention comes from the ability to explore the relationships between the detected entities in encoder and model the correlation between image regions and hidden states as attention mechanism [9], [10].
However, the self-attention operation in original Transformer framework has an obvious drawback, as the original VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Transformer assigns credits to all components of the context. It is unreasonable because many credits may be assigned to irrelevant information and should be ignored. For instance, the mainstream self-attention based mechanism estimates attention weights via multiplying the given query(hidden states) with key(encoded image features) from different modalities. The attention matrix is then applied to the value(encoded image features) to derive a weighted sum. Nevertheless, encoded image features may have little correlation to some irrelevant words(hidden states), which will lead to a very small fraction after multiplying the given query by key. So explicit sparse Transformer framework is needed to tackle the problem. More recently, local attention constraints was introduced in [11] and Zhao et al. [12] proposed to sparse attention with top-k strategy, however constraint local attention will break long term dependency and top-k strategy can not filter irrelevant information efficiently when the attention scores are very close. In order to resolve these problems mentioned above, we propose a novel model named Local Adaptive Threshold to explicit sparse attention matrix for image captioning, which mitigates the drawback of previous methods by enhancing attention selection ability. Technically, we group attention matrix into multiple chunks and the element will be filtered if its value is smaller than the average distribution of the data in the according chunk. Local Adaptive Threshold will not hurt long term dependency and is determined by the distribution of neighbor nodes for each element. For further investigation, we integrate Local Adaptive Threshold into self-attention and acts as attention mechanism in the decoder which triggers the interaction between visual features and natural description. In a nutshell, Local Adaptive Threshold pays little attention to the least contributive states and can perform more concentrated attention than original Transformer framework. On the encoder side, the application of selfattention encourages our model to explore the relationships between the detected entities. Furthermore, we extend the detected entities with extra vectors to encode a persistent memory, these vectors are shared across all attention heads instead of relying on certain head and can be designed as trainable weights.
The contributions of this work are summarized as follows: • Given that Transformer based attention mechanism may extract irrelevant information, we introduce a novel model named Local Adaptive Threshold to explicitly ignore the least contributive elements in attention matrix to reduce the extraction of irrelevant items and enhance the concentration of attention.
• We adopt a multi-level representations of image features to help encoder exploit low-level and high-level features, considering different feature levels may contribute different influences to the performance of model. Extra vectors are concatenated with feature vectors to encode a persistent memory.
• We conducted experiments on the MSCOCO [13] and Flickr30k [14] datasets. Our methods achieved a better performance of 129.2 CIDEr score on ''Karpathy'' offline test split and 23.5 Meteor score on Flickr30k test split compared to previous methods.
The rest of the paper is organized as follows. In Section II, we review related works on modern image captioning methods including earlier popular approaches, Transformed based methods and LSTM improved methods. In Section III, we introduce our framework in detail. In particular, we discuss the modifications we make on the encoder and decoder. Section IV reports some experimental results that demonstrate the effectiveness and efficiency of our model. Section V concludes the paper.

II. RELATED WORK
A. IMAGE CAPTIONING Popular approaches to image captioning adopt a deep encoder-decoder framework with attention mechanism. Typically, Anderson et al. [7] extract a set of salient image regions by a pooled convolutional feature vector using Faster R-CNN which enables the attention scores to be computed at the levels of salient image regions. Lu et al. [15] first generate a sentence template with blank slots which will be filled in by visual concepts using object detectors. Cornia et al. [16] propose a controllable approach to shift the rank of salient image regions by shift gate with adaptive attention. Li et al. [17] introduce a new architecture to facilitate vocabulary expansion and produce novel objects via pointing mechanism and object learners. Derived from human intuition, attention mechanism has achieved significant improvements for machine translation tasks. Attention mechanism normally calculates an importance score for each feature vector and normalizes to weights using a softmax function. Weights are applied to the feature vector to generate the attention result [6]. Chen et al. [18] propose a spatial and channel-wise attention mechanism that applies an attention mechanism in a channel-wise manner to be a process of selecting semantic attributes. Lu et al. [19] study an adaptive attention mechanism which is proposed to map feature vector to visual word or context word. Huang et al. [20] design an adaptive attention time mechanism to achieve arbitrary mapping between image region and word by adaptive steps across all time steps. More recently, more complex information such as attributes and relationships are integrated by GCN [21] to generate finegrained level captions [22], [23] [24].

B. TRANSFORMER BASED METHODS
More researchers tend to utilize Transformer framework to improve captions due to the excellent performance of selfattention operation. Herdade et al. [25] design an architecture that incorporates information about the spatial relationships between detected objects through geometric self-attention. Li et al. [26] devise a unique attention mechanism to exploit the visual and semantic information simultaneously in the transformer framework. In addition, Huang et al. [9] introduce an extension of the attention operator in which the final attended information is concatenated with the hidden state of LSTM and the attention context. For enhancing the connection between encoder and decoder, Cornia et al. [10] propose a meshed transformer to learn a hierarchical representation of the relationships between image regions and use a mesh-like connectivity at the decoding stage to exploit hierarchical image features. Multiple instance learning is also employed to build an attribute extractor in order to explore and distill cross-modal information where multi-head attention is adopted [27]. Pan et al. [28] introduce a X-Linear attention block which can measure both spatial and channelwise attention distributions by bilinear pooling techniques. Those works both take image features and hidden states as input to generate attention scores in self-attention operation, however not all the image regions or hidden states play an equal importance on the model, filter irrelevant elements is able to make attention more concentrated.

C. LSTM IMPROVED METHODS
A Long Short-Term Memory performs as a sequence model to generate words according to input image features which is another important branch worth exploring. Ke et al. [29] introduce reflective position module and reflective attention module while the former selectively attends to the hidden states and the latter determines the attention distribution over the input image regions. Qin et al. [30] introduce a novel LSTM model consists of two parts where the one looks back attention value of previous time step and input into the current time step while the other predicts forward the next two words in one time step to improve captioning performance. Zheng et al. [31] start the description with a selected object and generate other parts of the sequence based on this object, the left-side sequence is generated according to visual features and the input object, while the right-side sequence is generated by left-side sequence. Deshpande et al. [32] propose to utilize part-of-speech tags to produce fast and accurate description, the captioning model take part-of-speech tags, image features and hidden states as input to generate a unified attention output as common methods do. The number of works aiming to improve LSTM is far less than other methods, however it is still worth exploring.
Our work draws on those related works and introduces a novel model named Local Adaptive Threshold which can sparse attention matrix efficiently and perform more concentrated attention than original Transformer framework. In next section, we introduce the modifications on self-attention module and describe the architecture of our image captioning model in detail.

III. OUR METHOD A. LOCAL ADAPTIVE THRESHOLD ON SELF-ATTENTION
In a transformer framework, an attention function consists of a query and a set of key-value pairs where the query, key and value are all vectors. As shown in Figure 1(a), it first measures a similarity score between query and key and scales down the score by √ d k , where d k is normally the dimension of query and key. Then it applies a softmax function to obtain the weights on the values. The final output of the matrix can be computed as: where q, k and v represent query, key and value, respectively. Dot-product attention and multi-head attention are both proven to be efficient in practice in Transformer, where multi-head attention is concatenation of dot-product attention. In Figure 1(b), we integrate our sparse module between the scale function and the softmax function to convergent attention by selecting the items with higher numerical value and ignoring the elements with smaller numerical values, since softmax function is dominated by the largest elements. As shown in Figure 2, based on the hypothesis that the elements in matrix P with higher values represent a closer relevance, we select the most contributive elements at each row of matrix P to aggregate focus, where P is the matrix of multiplication between query and key.
Specifically, we split attention matrix into n chunks and compute the mean value of each chunk. The next step is to set The most contributive elements are assigned with higher probabilities after the operation of Local Adaptive Threshold selection and softmax function. VOLUME 8, 2020 the elements smaller than t percent of the average to negative infinity: where w is the window size and t represents the threshold value. The rationale is to maintain a local window for each element in the attention matrix and set irrelevant information to negative infinity according to the data distribution of neighbor nodes. After the operation of softmax function: those negative infinity elements will become zero and thus avoid the effect of negative noise. It should be noted that our proposed Local Adaptive Threshold draws on previous methods and is able to filter noisy information effectively.

B. MULTI-LEVEL REPRESENTATIONS IN ENCODER
Given an image, we first extract a set of feature vectors where k is the number of feature vectors in V and v i is a vector that represents a salient object region. Instead of directly feeding these vectors to the decoder, we build a multi-level image feature network consisting of low-level and high-level features to refine their representations as illustrated in Figure 3. Discriminant region feature vectors are essential for image captioning. In our paper, each image has a range of 10 to 100 salient image regions and each image region has a 2048-d dimension vector after feature extraction of Faster R-CNN [7]. Given feature vectors V ∈ R k×d , we feed them into the attention module proposed in the previous section. The overall operation is shown as follows: where i ∈ {1, 2, . . . 6}, W q , W k , W v are matrices of learnable weights, M k , M v are persistent memory vectors and V i+1 has the same cardinality as V i . We can infer that attentive weights depend solely on the pairwise similarities between the linear projections of the input feature vectors. Therefore the selfattention operator can be seen as a way of encoding pairwise relationships between region pairs. We consider such a selfattention operation as a single layer and stack certain layers for multi-level representations. Vectors generated by the first few layers are low-level feature representations and highlevel feature representations are generated by the last few layers. Intuitively speaking, different layers may contribute various effects to final image features. Different from the original self-attention operation, the key and value are concatenated with extra vectors to encode a persistent memory [33]. These memory vectors are designed to capture general knowledge instead of the information depend on the context because these vectors are shared across all attention heads instead of relying on certain head and considered as trainable weights that can be regarded as the persistent memory. These extra vectors do not depend on the input image feature vectors and are designed as learnable weights. Our experiments have shown the extraordinary effect of those extra vectors where the CIDEr score is about 0.6 percent higher.

C. DECODER WITH SPARSE MULTI-HEAD SELF-ATTENTION
A captioning model normally uses two LSTM layers to produce words where the first LSTM layer is used as an attention model and the second LSTM layer as a language model shown as Figure 4(a). Followed by [9], we apply multi-head self-attention to decoder. The input vector to the first LSTM layer at each time step consists of mean pooled image feature, the previous output of the language LSTM and words: where W e is a word embedding matrix for the vocabulary and t is one-hot encoding of the input word at time step t. With the LSTM adopted, the hidden state h 1 t and cell state c 1 t are modeled as: where x t is the input of the first LSTM layer. The iteration is performed in a for loop with an LSTMCell rather than executing automatically without a loop with an LSTM. Because we need to obtain the attention scores at each decode step, an LSTMCell is able to generate outputs in a single time step operation, whereas an LSTM would iterate over all time steps continously and provide all the outputs at once. Following the idea of self-attention, suppose that 213440 VOLUME 8, 2020 where i ∈ {1, 2, . . . 6}, v i is the i-layer feature vector from the encoder. For multi-head attention, we divide q, k, v into 8 slices and generate the attention matrix P: the final output representation of self-attention can be computed as: where A is the result of select operation on matrix P shown as Figure 2. The concatenate operation is shown as follows: In order to make full use of each encoding layer, the final att operation is defined as: where α i is a sigmoid gate representing the weight of att i and att i is the sparse self-attention operation of the i-th encoder layer. We make full use of each encoding layer and fuse the layers together by weighted sum in case of attending a single visual input from the encoder. The intention is that we exploit multi-level representations of image features and each level may have different contribution to the context of the decoder. The input to the language model consists of the extension attention operation, concatenated with the output of the first LSTM layer, given by: A gate linear unit is adopted afterwards instead of LSTM because a gate linear unit is more light-weight and can model context information as LSTM does: where f is a linear function and g represents a sigmoid function. The conditional probabilities of words in vocabulary at each time step can be computed as: The distribution over complete output sequences is calculated as the product of conditional distributions:  [36], CIDEr [37] and ROUGE-L [38] metrics are used to evaluate our method and compare with other methods. BLEU is used to measure the quality of machinegenerated text while individual parts of text are compared with reference text. However, BLEU heavily depends on the length of generated text and achieves good scores only if the generated text is short. Nonetheless, BLEU is a popular metric because of its pioneer in automatic evaluation of machine translation. In addition to comparing word segments to reference text, METEOR takes synonyms of words and stems of a sentence into consideration to make a better calibration. CIDEr is another important evaluation metric that adopts term frequency-inverse document frequency. It does not only calculate the frequency and proportion of specific words in a document, but also the proportion of documents with specific words in all documents. ROUGE is a set of metrics that measure the quality of text summary and are designed to compare word sequences, word pairs and n-grams with a set of reference summaries.

B. TRAINING DETAILS
Our code is implemented using PyTorch. For preprocessing coco captions, we map all words that occur less than 4 times to a special UNK token and create a vocabulary for all the remaining words. A pre-trained Faster-RCNN [39] model on ImageNet [40] and Visual Genome [41] is employed to extract bottom-up [7] feature vectors of images. For the MSCOCO dataset, the dimension of each feature vector is v ∈ R k×d where k is a variable and d is 2048. For the Flickr30k dataset, the dimension of each feature vector is similar v ∈ R k×d where k is fixed 36 and d is 2048. Moreover we project it to a new dimension of d = 1024 which is also the hidden size of the LSTM in the decoder. As for the training process, the objective is to maximize the sum of the log likelihood of each generated word: where θ represents the parameters to be learned and w t represents generated words at time step t. We train our model under cross entropy loss for 25 epochs with a mini batch size of 10 using an Adam [42] optimizer.
We increase the scheduled sampling probability [43] by 0.05 every 5 epochs and optimize the CIDEr score with SCST [34] for another 15 epochs to directly optimize the nondifferentiable metrics. The language model can be regarded as the agent to interact with the environment of images and words. The action is the process of predicting next word. The agent updates its state at the end of each action and gets the reward after meeting the end-of-sentence token where the state represents the hidden state and the reward indicates CIDEr score. The following formulation comes from SCST [34]: where the reward r(·) uses the score of CIDEr. The gradients can be approximated: where y s is a result sampled from a probability distribution, whileŷ indicates a result of greedy decoding.

C. BASELINE METHODS
To evaluate the effectiveness of our model in this study, we compare with several strong competitors in terms of four evaluation metrics on the stage of reinforcement training process as shown in Figure 5 and Figure 6. For fair comparison, all the competitive models are trained under cross entropy loss for 25 epochs and another 15 epochs for reinforcement training. A brief introduction of those baseline methods is as follows: FC model: This model was proposed in SCST [34]. The FC model first encodes the input image using a deep CNN and embeds it by a linear projection. Words are fed back into the LSTM and are represented using one hot vector that is embedded with a linear embedding, which has the same output dimension as the linear projection in the feature extractor. This method is quite simple compared to other encoder-decoder frameworks.
Show, Attend and Tell model: Rather than using a fixed representation of the image, this attention model [6] dynamically focuses on a specific region of the image at each time step to boost the performance in the deep encoder-decoder framework.
Bottom-Up and Top-Down model: This up-down model consists of bottom-up and top-down attention mechanism, which enables the attention scores to be computed at the levels of salient image regions [7]. The bottom-up attention mechanism determines the image regions to be focused, while the top-down attention mechanism decides the weights of different image regions.
Transformer model: This model [8] consists of components including multi-head self-attention, positional encoding and feed forward. In view of the excellent performance on numbers of natural language processing tasks, we compare our modified model to this implementation.
AAT model: This model is proposed in AAT [20] and implements an arbitrary mapping between image regions and words across all attention steps. Traditional attention  operations only compute weights in a single step while AAT model uses a threshold value to compute weights across a single step.
AoANet model: This model is proposed in AoANet [9] where the results of self-attention and initial query are concatenated together and fed into two linear layers to obtain information vectors by multiplication with a sigmoid gate. The final result is used as an alternative to the original selfattention operation.
The comparison suggests that our method performs better than the baseline methods on the reinforcement training process. In particular, our method outperforms AoANet, which is considered to be one of the best models in recent years.

D. QUANTITATIVE RESULTS
In Table 3, we compare our methods with several state-of-theart methods on MSCOCO dataset. The NIC [2] method was the first work proposed to apply deep encoder-decoder framework in image captioning. The SCST [34] method introduced the notion of reinforcement learning based training which is the key idea of improve CIDEr score. The Up-Down [7] method proposed a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of salient image regions. The AoANet [9] method introduced an extension of the attention operator in which the final attended information is weighted by a gate. Our work is developed on top of these methods. In particular, our work adopts feature extraction and reinforcement learning from these methods. The results in Table 3 show that our method is 15.2% higher than SCST [34] in terms of CIDEr score. Compared with Up-Down [7], our model is 9.1% higher. AoANet has been considered to be one of the most competitive model in recent years, but its Bleu score is still lower than ours. These results indicate that our model is better than these methods in some of the evaluation metric.
We further evaluate our method on the Flickr30k dataset which is another popular dataset for early image captioning tasks. Recent research rarely adopts it because the scale of Flickr30k dataset is much smaller than the MSCOCO dataset. As shown in table 4, our work slightly outperforms other methods in terms of Meteor score, but most metrics are lower than that of VS-LSTM [51], the reason may be that our model focus less on semantic information than VS-LSTM.
Moreover we evaluate our model on the official website of the MSCOCO team as shown in table 2, the evaluation is performed on the 40775 test split and 40504 validation   split. There are two evaluation settings where c5 represents that each image has five reference captions while c40 represents that each image has forty reference captions. Compared with the top-performing methods over the past few years, our model outperforms SGAE [22] slightly in some of the evaluation metrics such as Bleu-4 [35], Meteor [36], Rouge [38] and achieves a promising performance.

E. ABLATION STUDY
To quantify the impact of each design of our proposed module, we compare our method against a set of models without some settings. As shown in Table 5 and 6, we compare the effect of memory vectors in the encoder and the effect of different attention modules in the decoder. The ''base'' model does not have a memory vectors module in its encoder and adopts original Transformer as attention mechanism in the decoder. We can infer that sparse attention is able to improve scores and our proposed Local Adaptive Threshold performs better than top-k strategy. On the encoder side, the results also verify the efficiency of memory vectors. Obviously, the experiments conducted on the two datasets demonstrate the superiority of our proposed Local Adaptive Threshold in the decoder. Furthermore we discover that for MSCOCO dataset, the optimal value of w and t is 4 and 0.8, for Flickr30k dataset, the optimal value of w and t is 4 and 0.16 after extensive comparative experiments.
For more intuitive display, we show some examples of image captions generated by our model in Figure 7 and Figure 8. As shown in the first image and forth image in Figure 7, the captions include ''a book shelf'', ''tray'' that AoANet model does not cover, indicating that our method can generate captions that contain more objects in an image. Morevoer, we infer that our model can generate more verb than AoANet model, e.g. ''blowing out'' in the second image.  For the Flickr30k dataset, it is more likely for our method to generate adjective words like ''pink'', ''straw'' and more concrete nouns, e.g. ''children'' instead of ''people'' compared to AoANet model. In a word, our method can generate more content related and grammatically correct sentences.
It is clear that the sentence generated by our method is able to visually describe the basic content of a picture. Compared to human-annotated ground truths, their sentences contain more detailed information including rich adjectives describing the objects and objects that does not appear at generated captions. Moreover, our generated sentences typically contain expressions like ''a man'' or ''a woman'', while human-annotated ground truths typically contain ''a young girl'' which is more vivid. We can observe that pre-trained image features are underutilized when generating sentences and captions can not produce rich information as humanannotated ground truths do. Future research could overcome those drawbacks.

V. CONCLUSION
In this article, we introduce a novel model called Local Adaptive Threshold which can explicit sparse Transformer effectively. Local Adaptive Threshold will not hurt long term dependency and is determined by the distribution of neighbor nodes. It is able to make the attention in original Transformer more concentrated on the most contributive components. We integrate Local Adaptive Threshold into self-attention and acts as attention mechanism in the decoder which help the model generate more accurate word. Moreover, comprehensive comparisons with state-of-the-art methods and adequate ablation studies demonstrate the effectiveness of our strategies. In the future, we will develop novel frameworks on image captioning to further advance our research such as unsupervised learning. The development of advanced image captioning method is vital for deep image understanding.