Hybrid Attention Distribution and Factorized Embedding Matrix in Image Captioning

Self-attention mechanisms are widely used in current encoder-decoder frameworks of image-captioning tasks. These mechanisms use the transformer module as the basic unit to compute the similarity between query and key vectors, and normalize them to obtain the attention distribution matrix, weighting and reconstructing the rich semantic query vectors. However, a single attention distribution matrix cannot express more complex intrinsic relations between query and key vectors. Some very important intrinsic relations can thus become lost. We therefore propose a hybrid attention distribution (HAD) that allows multiple distributions to be reconstructed to express deeper internal relations and avoids a single shallow attention distribution. In addition, to reduce the number of model parameters, we factorize the word-embedding matrix. This effectively increases the training efficiency and prevents the metrics from decreasing. We extensively evaluate our proposals using the MS-COCO image-captioning dataset. Our results outperform existing state-of-the-art methods on some metrics.


I. INTRODUCTION
Image captioning is a task wherein the content of an image is described in human language. Its challenge lies in that it aims to interface information between two completely different forms of information: image information and natural language information. Image captioning technology can be applied to help a blind person visualize the real world, or to help a driver observe emergency situations in the field of autonomous driving. Such benefits strongly motivate further studies of this technology.
Conventional methods of image captioning predominantly adopt the encoder-decoder framework [1]- [4] and use a word-embedding matrix to represent words. Usually, the CNN encoder is used to extract image features, and the RNN decoder is used to decode one word in a given time step based on the image features and the context information from all preceding time steps. In recent years, attention mechanisms based on the encoder-decoder framework [5]- [13] have been introduced to make the image-captioning model focus on the image of a particular region to generate the The associate editor coordinating the review of this manuscript and approving it for publication was Jeon Gwanggil . current word in each time step. Remarkable success has been achieved. However, in the attention mechanism, the model involves only a single distribution for calculating the attention weight distribution. In some cases, the attention weight distribution is uniform, i.e., every region contributes a similar image to the model. The transformer-based module [14], [15] calculates the similarity between the query and key vectors. Although it is divided into multiple subvectors to calculate their similarity, but all of these subvectors are not linked, such that the weight distribution within the subvectors is independent. To some extent, this limits the model performance. At the same time, the word-embedded representation constitutes a large proportion of the model parameters. It is therefore reasonable to reduce the number of those parameters without compromising the model performance.
To address these two problems, we propose using a hybrid attention distribution (HAD) and a factorized embedding matrix (FEM). The HAD comprehensively considers the weight distribution based on the self-attention, computing the weight distribution between the query and key vectors in the subspace. For each subspace, we perform weighted summation of their attention distribution, to obtain hybrid attention distribution. This computing method is executed in parallel with the replacement of the original single distribution that makes the image-captioning model relate to each subspace. The FEM can reduce the number of model parameters and effectively reduce the training time by decomposing the large vocabulary-embedding matrix into two small matrices. First, a word is embedded within a lower dimensional space, and another matrix is then used to restore the original dimensional space. The advantage of this operation is that it reduces the number of parameters without reducing the model performance.
The main contribution of this work is the proposal of a HAD and a FEM modules. The former can comprehensively consider multiple different weight distributions in the subspace to obtain a more expressive hybrid distribution. The aim of the FEM is to reduce the number of parameters in a large word-embedding matrix to make the model converge faster and without affecting the model performance.

II. RELATED WORK A. IMAGE CAPTION
Traditional approaches to describing images are templatebased. These involve firstly generating slotted caption templates and filling them using detected objects and attributes [16]- [18]. These methods are relatively efficient in terms of linguistic logic, but they have major drawbacks in terms of diversity. Recently, the neural-based method successfully performed image captioning by applying the encoder-decoder framework, whereby a CNN is first used to encode an image and a RNN is adopted to decode the output words [1]. Soft and hard attentions [10] were introduced subsequently to focus the model on an important area of the image in each time step. Semantic attention [7] was presented to learn to focus selectively on the semantic attributes in an image for sentence generation. An adaptive attention model [9] was proposed to decide whether to attend to image regions at the time of decoding. Furthermore, bottom-up and top-down attention mechanisms [11] can be exploited to focus on objects rather than image regions. A spatial and channel-wise attention model [12] was proposed, and some RL methods [19], [20] were introduced to train the model.

B. TRANSFORMER-BASED METHOD
Since method [15] was introduced, many methods [21]- [24] based on it have been used in natural language processing, including for application [14] to image captioning. The transformer structure has also undergone many improvements [25]- [28]. Owing to the considerable number of model parameters of the transformer, some studies [24], [29], [30] have investigated ways to reduce it while ensuring stable performance. Our work draws on their ideas [24], [28] to investigate how to model the relationships of complex objects in an image-captioning task, and how to effectively reduce the number of model parameters.
We thus propose using HAD and FEM modules as a means to manage complex relationships and reduce large numbers of model parameters. Our experimental results validate our approach, which outperform the best existing models on some metrics.

III. METHOD
We first introduce the FEM to solve the problem of many model parameters. Then, we add the HAD based on the transformer module and apply it to the encoder and decoder for image captioning.

A. FACTORIZED EMBEDDING MATRIX
In the field of natural language processing, tokens are usually expressed in the form of vectors that allow computers to process human language. The conventional approachs [1], [10] involve using the word-embedding matrix to project tokens onto fixed dimensions. This word embedding can be pretrained (e.g., using GloVe or Word2vec [31], [32]) or randomly initialized as part of the model parameters. In NLP, the size of the vocab |V| is much greater than the number of dimensions of the word embedding E. This part can thus reach millions or even tens of millions of parameters. We proceed to solve this problem [24] by first projecting the token onto a lower dimension H using a word-embedding matrix, and then restoring the original word-embedding E dimension via projection mapping, as shown in Figure 1 (b). The advantage of doing this can not only reduce the parameters but also increase the relevance between words in the vocabulary. This operation can reduce the original complexity The above derivation can be expressed as where t , Emb, w t and Linear denote, respectively, the onehot encoding of the input word at time step t, the tokenembedding matrix, the word token, and projection mapping.

B. HYBRID ATTENTION DISTRIBUTION
The basic transformer module is widely used in NLP tasks. At its core is the self-attention mechanism: where f att quantifies the similarity between two vectors, and Q i , K i and V i denote some queries, keys, and values, respectively. While this computation provides a good measure of the correlation between vectors, it is difficult to represent the similarity between vectors that correspond to different modalities, e.g. images and language. Therefore, after the high-dimensional image or language feature vector is divided into several low-dimensional subspaces, these subspaces are independent of each other. In order to connect them to each other to improve the expressive ability of the model, we propose using the HAD based on Eq. 2, as described in Figure 1 (a).
In the HAD, the function f att superimposes multiple intrinsic inter-vector similarities to express complex intrinsic relationships. This computation requires only very few additional parameters to allow the model to capture complex multimodal information. For given Q i , K i , f att models as: where h is the number of heads and Λ ∈ R h×h are model parameters that can automatically measure the correlation between each head during training. Z = [Z 1 , Z 2 , · · · , Z h ] quantifies the similarity between vectors in multiple heads as: where d k is the subspace dimension. Our experiments revealed that, when using only Eq.4 to calculate the similarity between vectors, the inherent variance fluctuates greatly, resulting in a very steep rise in the softmax function, which prevents model convergence. Thus, before calculating Z i , we first normalized Q i and K i (e.g., by instance normalization or layer normalization) to ensure numerical stability. For simplicity, this was omitted in Eq.4. Then we specify the expression for each head i ∈ R L×d k , where L is the length of the input tokens: Finally, the final expression for our HAD is obtained as: where the function concat is the operation of splicing the individual heads in the last dimension.

C. ENCODER WITH HAD
We build our network model according to the classic encoder-decoder structure and embed HAD within it, as shown in Figure 2. Image captioning involves describing a given image I using a human language Y 1:T . The sentence Y 1:T = {w i , w 2 , . . . , w T } consists of a sequence of tokens.
The image I is represented as a set of regional image features V = {v 1 , v 2 , . . . , v L } by applying CNN or Faster R-CNN [33], where v i ∈ R D . The outputV (i) for i-th layer depends on the Q, K , V of (i − 1)-th layer: where W The purpose of stacking such layers in the encoder like [15] is to enable the model to express the deep inter-relations between regions for a given image. Notably, Eq.7 omits the GLU activation [34] module for simplicity.

D. DECODER WITH HAD AND FEM
The role of the decoder in the image captioning is to generate matching tokens for given image-region featuresV (N ) , as depicted in Figure 3. The LSTM is first based on the input that the embedding of the word at current time step and a visual vector denotes global image features and c t−1 denotes the context information in the previous time step, to generate current hidden state h t : where w t is derived from Eq.1 and h t , m t ∈ R D .

FIGURE 3.
Overview of our decoder for image captioning. we encapsulate Eq.1 with the FEM module and use the hidden state generated by LSTM and the outputV (N) from the encoder of the last layer as input to the HAD module, which acts as Q, K , V respectively, to obtainv t . Finally, after the GLU layer, the linear layer and the softmax layer, we obtain the next step word w t . VOLUME 8, 2020 From the structural diagram in Figure 3, c t is obtained from the GLU and HAD modules: where the [, ] notation indicates splicing in the last dimension and W Q , W K , W V ∈ R D×D are parameters to be learnt. Once c t is obtained, the conditional probability can be calculated, based on the vocab, to obtain the next time step of the word: where W o ∈ R |V|×D contains the learning parameters of the model.

E. TRAINING STRATEGY
According to the training strategy of most image captioning models [11], [14], [20], the model is trained using two optimization methods. Training with Cross Entropy Loss. We first train our model using the cross entropy loss (XE) as: where Y 1:t−1 denotes the target ground truth sequence. Training with RL. After training with XE is complete, we use the parameters to initialize RL's training strategy. Following Self-Critical Sequence Training (SCST): where the reward r(·) uses the score of some metrics (e.g. CIDEr-D). The gradients can be approximated:

IV. EXPERIMENTS A. DATASET AND IMPLEMENTATION DETAILS
All the experiments are conducted on the most popular image captioning benchmark MS COCO [37] which contains 123287 images labeled with at least 5 captions for each by 5 different people. Following many previous works [11], [14], [20], we use the standard Karpathy data split for the offline performance camparsions, in which 5000 images are used for validation, 5000 are used for testing and 113287 are used for training. MS COCO provides 40,775 images as test set for online evaluation as well. We convert all sentences to lower case and drop the words that occur less than 6 times, leading to the final vocabulary with 9488 unique words. For evaluation metrics, we use BLEU [38], METEOR [39], ROUGE-L [40], CIDEr [41] and SPICE [42] are simultaneously utilized to evaluate our model. We employ a pre-trained Faster-RCNN [33] model on ImageNet and Visual Genome [43], [44] to extract bottom up feature vectors [11] of images. We take the output of RoI pooling layer and perform non-maximum suppression for each object class using an IoU threshold. We then select all regions where any class detection probability exceeds a confidence threshold, thus each image has a number of regions ranging from 10 to 100. The dimension of the original vectors is 2048 and we project them to a new space with the dimension of D = 1024, which is also the hidden size of the LSTM in the decoder. We set H=128 and E=1024 in our strong model. As for the training process, we train our model under XE loss for 30 epochs with a mini batch size of 10, and ADAM optimizer is used with a learning rate initialized by 2e-4 and annealed by 0.8 every 3 epochs. We increase the scheduled sampling [45] probability by 0.05 every 5 epochs. We optimize the CIDEr-D score with SCST for another 15 epochs with an initial learning rate of 2e-5 and annealed by 0.5 when the score on the validation split does not improve for some training steps.

B. QUANTITATIVE ANALYSIS
We report the performance of our model compare it with the current state-other-art in Tables 1 and 2. These models include NIC [1], which uses a vanilla CNN-LSTM encoderdecoder framework but lacking of deep feature extraction and fusion; SCST [20], which applies a standard attentionbased model and is first to use SCST to directly optimize non-differentiable metrics but it is also insufficient for multi-modal information fusion; Up-Down [11] which uses an attention LSTM to attend over image features extracted from a Faster R-CNN model. However, it lacks further extraction of image regions features; RFNet [35], which fuses encoded features from multiple CNN networks; GCN-LSTM [36], which predicts visual relationships between every two entities in the image and encodes the relationship information into feature vectors; and finally AoANet [14] uses an information vector and an attention gate to obtain attended information based on transformer language model. However, in the internal calculation of the transformer, the relations between the sub-heads are not considered.
For the cross-entropy loss training stage in Table 1, we first decompose the word embedding matrix on the basis of [14], keeping the rest of the model unchanged and the indicators kept. Except for a slight decrease in CIDEr, the rest of the metrics remained the same or slightly improved and the number of model parameters was reduced. These results  demonstrate the effectiveness of FEM at reducing the model size while maintaining stable performance.
Then we use our model(HAD+FEM) to validate the performance of HAD, our single model achieves higher score for all the metrics than AoANet+FEM, with the exception METEOR. Our model is higher than [14] on BLEU1∼ 4 and SPICE indicators, but the METEOR and CIDEr scores was marginally lower than [14].
For the sequence-level optimization stage in Table 2, our model also achieves the highest scores for Bleu1∼ Bleu4 and SPICE. Because of the reduced size of our model (from 85.6M to 77.2M), the performance is slightly less for METEOR, ROUGE-L, and CIDEr, but nonetheless approaches the highest score.
We also evaluate our single model on the online COCO test server in Table 3. The results show that our single model performance is close to [35] and [36], slightly improved on ROUGE and METEOR indicators. Figure 5 shows some results generated by our HAD model. It displays a strong baseline as well as the human-annotated ground truth. We derive the baseline model by re-implementing [14]. These examples suggest that the baseline is syntactically logical, but there are problems with the description of specific objects. Furthermore, (1) our VOLUME 8, 2020  model correctly describes the specific objects in the image. For example, in the first example, the baseline describes couch as table, and in the second example (from left to right), the baseline mistakes the hot dog for a sandwich, In contrast, our model yields accurate descriptions. (2) Our model correctly describes the number of specific objects in the image. In the third and fourth examples, our model accurately recognizes that there are three cupcakes and a black and white cat looking out the window in the picture, while the baseline model depicts them as a group of cupcakes and two black and white cats. Our model also focuses on a wooden bench in the picture of the fifth example, and not on two wooden benches. The above advantages of our model stem precisely from the HAD module, which is able to focus on multiple layers of specific complex objects in an image to understand their relationships and to accurately generate corresponding words in the decoding phase. Thus, our model performs better than the baseline in terms of accuracy and fluency.

C. QUALITATIVE ANALYSIS
In the sixth example, however, our model depicts a truck parked in the forest and does not represent the snow scene. The baseline model describes the snow but fails to describe the forest. This example highlights the remaining challenges of image captioning and scope for future research. Table 4 reports the effect of using different values of H in the FEM module on the model performance. The best performance is achieved for H = 128, using the highest values for each metric. As H decreases to 32, the indicators appear significantly reduced, but the values of Bleu1, METEOR and SPICE are greater than the situation at H = 16 and H = 8. We perform this analysis because, for smaller values of H , the number of model parameters changes little, resulting in little change in the model's performance. In the extreme case H = 1, the model performance is better than H = 8, 16, 32 for some metrics. Figure 6 also analyses the model convergence rate at the training stage for different values of H . The results are compared with the baseline model [14]. Figure 6 clearly shows that the model converges much better than the baseline in Bleu1 Bleu4, with the higher values of H achieving faster convergence. For example, our model converges after 200k iterations while the baseline converges after approximately 300k iterations. This advantage is less pronounced for the CIDEr and SPICE metrics, but is nonetheless apparent, relative to the baseline criterion. We can thus conclude that our model converges faster than the baseline model. In Figure 4, we observe that the attention map in HAD is lighter than the baseline and focuses on different positions at each time step which indicats that our model is able to handle the intrinsic relationships of complex objects to generate the corresponding correct word, e.g., couch instead of table.

V. CONCLUSION
We have proposed a method that uses the hybrid attention distribution and the factorized embedding matrix to address problems above in our paper. The former solves the problem wherein a single distribution cannot effectively express the complex interconnections between multimodal states. The latter effectively reduces the model parameters and makes the model converge faster. Experiments performed on the MS COCO dataset demonstrate that our approach achieves state-of-the-art performance across some image captioning metrics.